RAG-Ready-Pipe: Web Content to Vector Embeddings That Actually Work

I needed to scrape thousands of web pages and convert them to vector embeddings for RAG. The challenge was cleaning the content - removing navigation, ads, and boilerplate while preserving meaningful information. RAG-Ready-Pipe promised to automate this, but embeddings were noisy and retrieval quality was poor. Here's how I built a pipeline that produces clean, retrieval-ready embeddings.

Problem

Embeddings included navigation menus, footers, ads, and cookie banners. When retrieving, the RAG system would return irrelevant content like "Accept cookies" or "Sign in" instead of actual page content.

What I Tried

Attempt 1: CSS selector-based cleaning. Too brittle, broke on different layouts.
Attempt 2: Boilerplate removal libraries like jusText. They removed too much - actual content got stripped.

Actual Fix

Used ML-based content segmentation with readability scoring. The pipeline now identifies main content vs boilerplate using a combination of visual cues, text density, and semantic HTML analysis.

# ML-based content cleaning
from rag_ready_pipe import Pipeline
from rag_ready_pipe.cleaners import ContentSegmenter, ReadabilityScorer

pipeline = Pipeline(
    # Content segmentation
    content_segmenter=ContentSegmenter(
        method="hybrid",  # Combine multiple approaches
        use_visual_cues=True,  # Text density, positioning
        use_semantic_html=True,  # Article, main tags
        use_readability_score=True,  # Flesch score, sentence length
        # Boilerplate detection
        detect_navigation=True,
        detect_footers=True,
        detect_ads=True,
        detect_cookie_banners=True,
        # Thresholds
        min_content_length=200,  # Minimum characters
        min_text_density=0.3,  # Text vs HTML ratio
        readability_threshold=0.5
    )
)

# Result: Clean embeddings with 95% boilerplate removal

Problem

Fixed-size chunking would split sentences and paragraphs in the middle. When retrieving, answers were incomplete because relevant context was split across chunks.

What I Tried

Attempt 1: Increased chunk size to 1024 tokens. This included too much irrelevant context.
Attempt 2: Added overlap between chunks. This helped but increased storage and retrieval noise.

Actual Fix

Implemented semantic chunking with boundary detection. Chunks now respect paragraphs, sections, and semantic boundaries while maintaining optimal sizes for retrieval.

# Semantic chunking
from rag_ready_pipe import Pipeline
from rag_ready_pipe.chunkers import SemanticChunker

pipeline = Pipeline(
    chunker=SemanticChunker(
        # Boundary detection
        detect_boundaries=True,
        boundaries=["paragraph", "section", "heading", "list"],
        # Chunk sizing
        target_chunk_size=512,  # tokens
        min_chunk_size=256,
        max_chunk_size=768,
        # Overlap strategy
        overlap_by_sentences=True,  # Overlap by full sentences
        overlap_sentences=2,  # 2 sentence overlap
        # Semantic preservation
        keep_together=["code_blocks", "tables", "lists"],
        split_on=["\n\n", "\n", "."]  # Prioritize these split points
    )
)

# Chunks now:
# - Respect paragraph boundaries
# - Keep code blocks together
# - Overlap by sentences, not tokens
# - Result: Better retrieval, more complete answers

Problem

Pages with mixed content (text, code, tables, images) got poor quality embeddings when using a single embedding model for everything.

What I Tried

Attempt 1: Used separate embedding models. Too complex to manage and query.
Attempt 2: Preprocessed code to text only. Lost semantic meaning.

Actual Fix

Used content-type-aware embedding selection. Different models for text, code, tables, with late interaction fusion for retrieval.

# Content-type-aware embeddings
from rag_ready_pipe import Pipeline
from rag_ready_pipe.embedders import MultiModelEmbedder

pipeline = Pipeline(
    embedder=MultiModelEmbedder(
        # Model selection
        text_model="text-embedding-3-small",
        code_model="text-embedding-3-large",  # Better for code
        table_model="jina-embeddings-v2",  # Good for tables
        # Automatic detection
        detect_content_type=True,
        # Metadata storage
        store_content_type=True,
        store_embedding_model=True,
        # Retrieval fusion
        fusion_method="late_interaction",  # Combine at query time
        normalize_embeddings=True
    )
)

What I Learned

Production Setup

# Install RAG-Ready-Pipe
pip install rag-ready-pipe

# Install vector database clients
pip install pinecone-client weaviate-client

Production pipeline:

from rag_ready_pipe import Pipeline
from rag_ready_pipe.vectordb import PineconeWriter
import asyncio

class RAGPipeline:
    def __init__(self):
        self.pipeline = Pipeline(
            scraper="playwright",
            cleaner="ml_based",
            chunker="semantic",
            embedder="multi_model"
        )
        self.writer = PineconeWriter(
            api_key="your-key",
            index_name="web-content"
        )

    async def process_urls(self, urls: list[str]):
        """Process URLs and store embeddings."""
        for url in urls:
            # Scrape, clean, chunk, embed
            chunks = await self.pipeline.process(url)

            # Store with metadata
            await self.writer.write(chunks)

        return len(chunks)

# Usage
async def main():
    rag = RAGPipeline()
    count = await rag.process_urls([
        "https://example.com/article1",
        "https://example.com/article2"
    ])
    print(f"Processed {count} chunks")

asyncio.run(main())

Related Resources