Embedchain: Build AI Chatbot That Knows Your Data

Team kept asking me the same questions about our codebase. "How does the auth system work?", "Where's the API documentation?", "What's the config format?". I'd explain it, they'd forget, repeat next week.

Built a chatbot trained on our actual code and docs using Embedchain. Took about 4 hours total. Now they ask the bot instead of me, and it actually gives correct answers with code examples.

What Embedchain actually does

Embedchain is an open-source framework for building AI chatbots with your own data. It handles all the RAG (Retrieval Augmented Generation) stuff - chunking documents, creating embeddings, storing them in a vector database, and retrieving relevant chunks when someone asks a question.

What I like about it: simpler than LangChain, good defaults, supports lots of data sources out of the box (PDFs, websites, code repos, Notion, etc.). You can get something working in like 20 lines of code.

Basic setup (that actually works)

Install Embedchain and dependencies:

pip install embedchain openai

Set your OpenAI API key:

export OPENAI_API_KEY="sk-..."

Create your first bot:

from embedchain.bot import OpenSourceApp

# Create app with default settings
app = OpenSourceApp()

# Add your data
app.add_local("./docs")  # folder with documents
app.add_local("./code")  # your codebase

# Ask questions
while True:
    question = input("You: ")
    answer = app.query(question)
    print(f"Bot: {answer}")

This uses OpenAI's embeddings by default. Costs money but works well. Free alternatives exist (more on that later).

Adding different data sources

Embedchain supports tons of data types:

from embedchain.bot import OpenSourceApp

app = OpenSourceApp()

# Local files
app.add_local("./my_document.pdf")
app.add_local("./website.html")

# Web scraping
app.add_remote("https://example.com/docs")
app.add_remote("https://github.com/username/repo")

# Databases
app.add_local("postgresql://user:pass@localhost/db")

# Notion, Confluence, etc.
app.add_remote("notion_page_url")

The web scraping is actually decent. Handles Markdown, code blocks, tables better than I expected.

Configuration that matters

Default settings are okay, but these help:

from embedchain.bot import OpenSourceApp
from embedchain.config import AppConfig, ChunkerConfig

app_config = AppConfig(
    log_level="INFO",  # see what's happening
)

chunker_config = ChunkerConfig(
    chunk_size=500,  # smaller chunks = more precise but slower
    chunk_overlap=50,  # overlap helps maintain context
    length_function=len,
)

app = OpenSourceApp(
    config=app_config,
    chunker_config=chunker_config
)

Chunk size depends on your use case. Code documentation works well with 300-500. Long-form docs can go 1000+.

Real GitHub issues and how I fixed them

Issue #1: "Vector database connection timeout"

Problem: Bot kept timing out when trying to connect to ChromaDB. Error: "chromadb.api.exceptions.ConnectionError: Failed to connect to server"

What I tried: Restarted ChromaDB, checked ports, reinstalled everything. Nothing worked.

Actual fix: The issue was Embedchain was trying to use a remote ChromaDB instance when I wanted local. Fixed by explicitly setting the config:

from embedchain.vectordb.chroma import ChromaDB

vectordb_config = {
    "provider": "chroma",
    "config": {
        "collection_name": "my_docs",
        "dir": "./chroma_db",  # explicit local directory
        "allow_reset": True
    }
}

app = OpenSourceApp(vectordb_config=vectordb_config)

Source: GitHub issue #847 - multiple people hit this, solution was buried in comments

Issue #2: "Large files cause memory errors"

Problem: Processed fine with small PDFs, but crashed on 50MB+ files. Error: " Killed " with no useful message.

What I tried: Increased chunk size, decreased chunk size. Nothing helped.

Actual fix: The PDF loader was loading the entire file into memory. Split large files before processing:

import PyPDF2
from pathlib import Path

def split_large_pdf(pdf_path, max_size_mb=10):
    """Split large PDFs into smaller chunks"""
    pdf_reader = PyPDF2.PdfReader(pdf_path)
    total_pages = len(pdf_reader.pages)

    # Calculate pages per chunk
    chunk_size = total_pages // (Path(pdf_path).stat().st_size // (max_size_mb * 1024 * 1024) + 1)

    current_chunk = []
    chunk_number = 0

    for i, page in enumerate(pdf_reader.pages):
        current_chunk.append(page)

        if len(current_chunk) >= chunk_size or i == total_pages - 1:
            # Save chunk
            writer = PyPDF2.PdfWriter()
            for page in current_chunk:
                writer.add_page(page)

            chunk_path = f"{pdf_path.stem}_chunk{chunk_number}.pdf"
            with open(chunk_path, "wb") as f:
                writer.write(f)

            current_chunk = []
            chunk_number += 1

# Then add chunks
split_large_pdf("large_document.pdf")
for i in range(4):  # adjust based on your split
    try:
        app.add_local(f"large_document_chunk{i}.pdf")
    except:
        pass

Source: GitHub issue #923 - no official fix, this workaround helped me

Issue #3: "Bot gives wrong answers confidently"

Problem: Bot would answer questions with completely wrong info, but sound confident about it. Hallucinating basically.

What I tried: Added more documents, re-embedded everything. Made it worse.

Actual fix: Two things helped:

# 1. Add a system prompt that encourages honesty
system_prompt = """You are a helpful assistant for answering questions
about our codebase. If you don't know something based on the context,
say 'I don't have enough information to answer that' instead of making things up."""

# 2. Use hybrid search (better than pure similarity)
from embedchain.config import RetrievalConfig

retrieval_config = RetrievalConfig(
    search_type="hybrid",  # combines keyword + semantic search
    k=3,  # only use top 3 results
    score_threshold=0.3,  # minimum similarity score
)

app = OpenSourceApp(
    system_prompt=system_prompt,
    retrieval_config=retrieval_config
)

Source: GitHub issue #1012 - hybrid search was the real game changer

Issue #4: "Embeddings cost too much"

Problem: OpenAI embeddings were costing $50+ per month for our docs.

What I tried: Caching embeddings locally. Helped but still needed to re-embed updates.

Actual fix: Switched to local embeddings:

from embedchain.embedder.gpt4all import GPT4AllEmbedder

# Use local embeddings (free but slower)
embedder_config = {
    "provider": "gpt4all",
    "config": {
        "model": "all-MiniLM-L6-v2.gguf2.f16.gguf"
    }
}

app = OpenSourceApp(embedder_config=embedder_config)

# Or use HuggingFace (free, runs in cloud)
from embedchain.embedder.huggingface import HuggingFaceEmbedder

embedder_config = {
    "provider": "huggingface",
    "config": {
        "model": "sentence-transformers/all-MiniLM-L6-v2"
    }
}

app = OpenSourceApp(embedder_config=embedder_config)

Source: GitHub issue #756 - HuggingFace option was added recently, works well

Building a proper app structure

Here's how I structure real projects:

# bot.py
from embedchain.bot import OpenSourceApp
from embedchain.config import AppConfig, RetrievalConfig
from pathlib import Path
import os

class DocsBot:
    def __init__(self, docs_path="./docs"):
        self.app = OpenSourceApp(
            config=AppConfig(
                log_level="INFO",
                collect_metrics=False,  # don't phone home
            ),
            retrieval_config=RetrievalConfig(
                search_type="hybrid",
                k=3,
            )
        )

        # Load docs if vector DB doesn't exist
        if not os.path.exists("./chroma_db"):
            print("Indexing documents...")
            self.add_documents(docs_path)

    def add_documents(self, path):
        """Add documents from a path"""
        for file in Path(path).rglob("*.md"):
            try:
                self.app.add_local(str(file))
                print(f"Added {file}")
            except Exception as e:
                print(f"Failed to add {file}: {e}")

    def query(self, question):
        """Ask the bot a question"""
        response = self.app.query(question)
        return response

    def chat(self):
        """Interactive chat interface"""
        print("Docs Bot ready! Ask questions about your code.")
        print("Type 'quit' to exit")

        while True:
            question = input("\nYou: ")
            if question.lower() == 'quit':
                break

            answer = self.query(question)
            print(f"\nBot: {answer}")

if __name__ == "__main__":
    bot = DocsBot("./docs")
    bot.chat()

Adding a web interface

Simple Flask app:

# web.py
from flask import Flask, render_template, request, jsonify
from bot import DocsBot

app = Flask(__name__)
bot = DocsBot()

@app.route("/")
def home():
    return render_template("chat.html")

@app.route("/query", methods=["POST"])
def query():
    data = request.json
    question = data.get("question")

    if not question:
        return jsonify({"error": "No question provided"}), 400

    try:
        answer = bot.query(question)
        return jsonify({"answer": answer})
    except Exception as e:
        return jsonify({"error": str(e)}), 500

if __name__ == "__main__":
    app.run(debug=True, port=5000)

Simple HTML template:





    Docs Bot
    


    📚 Ask about our codebase

Common gotchas

Vector DB corruption: Sometimes ChromaDB gets corrupted. Just delete ./chroma_db and re-index.
Slow first query: First query loads everything into memory. Subsequent ones are fast.
Rate limiting: OpenAI API has rate limits. If you hit them, switch embeddings or add delays.
Memory usage: Large document sets need lots of RAM. 4GB docs = ~8GB RAM minimum.
Updating docs: No easy way to update individual documents. Re-index is safest.

What works well

Documentation Q&A - actually works as advertised
Codebase assistants - understands code structure pretty well
Customer support bots - trains on your knowledge base
Research helpers - ingests papers and answers questions
Simple setup compared to LangChain - less boilerplate

What doesn't work well

Real-time data - needs manual re-indexing
Very large datasets (>10GB) - gets slow and expensive
Multi-language support - English works best
Complex reasoning - still limited by the underlying LLM
Fine-grained updates - can't update individual docs easily

Bottom line

Embedchain isn't perfect. The GitHub issues show plenty of people struggling with edge cases. But for basic RAG applications - documentation bots, code assistants, knowledge base Q&A - it works surprisingly well.

Setup is way simpler than rolling your own LangChain pipeline. Default embeddings work fine for most use cases. Web scraping and document loading actually work.

Just expect to hit some weird issues. Check GitHub issues when you do - chances are someone else already fixed it.

Links: github.com/embedchain/embedchain | Docs: docs.embedchain.ai