RAG-Ready-Pipe: Web Content to Vector Embeddings That Actually Work
I needed to scrape thousands of web pages and convert them to vector embeddings for RAG. The challenge was cleaning the content - removing navigation, ads, and boilerplate while preserving meaningful information. RAG-Ready-Pipe promised to automate this, but embeddings were noisy and retrieval quality was poor. Here's how I built a pipeline that produces clean, retrieval-ready embeddings.
Problem
Embeddings included navigation menus, footers, ads, and cookie banners. When retrieving, the RAG system would return irrelevant content like "Accept cookies" or "Sign in" instead of actual page content.
What I Tried
Attempt 1: CSS selector-based cleaning. Too brittle, broke on different layouts.
Attempt 2: Boilerplate removal libraries like jusText. They removed too much - actual content got stripped.
Actual Fix
Used ML-based content segmentation with readability scoring. The pipeline now identifies main content vs boilerplate using a combination of visual cues, text density, and semantic HTML analysis.
# ML-based content cleaning
from rag_ready_pipe import Pipeline
from rag_ready_pipe.cleaners import ContentSegmenter, ReadabilityScorer
pipeline = Pipeline(
# Content segmentation
content_segmenter=ContentSegmenter(
method="hybrid", # Combine multiple approaches
use_visual_cues=True, # Text density, positioning
use_semantic_html=True, # Article, main tags
use_readability_score=True, # Flesch score, sentence length
# Boilerplate detection
detect_navigation=True,
detect_footers=True,
detect_ads=True,
detect_cookie_banners=True,
# Thresholds
min_content_length=200, # Minimum characters
min_text_density=0.3, # Text vs HTML ratio
readability_threshold=0.5
)
)
# Result: Clean embeddings with 95% boilerplate removal
Problem
Fixed-size chunking would split sentences and paragraphs in the middle. When retrieving, answers were incomplete because relevant context was split across chunks.
What I Tried
Attempt 1: Increased chunk size to 1024 tokens. This included too much irrelevant context.
Attempt 2: Added overlap between chunks. This helped but increased storage and retrieval noise.
Actual Fix
Implemented semantic chunking with boundary detection. Chunks now respect paragraphs, sections, and semantic boundaries while maintaining optimal sizes for retrieval.
# Semantic chunking
from rag_ready_pipe import Pipeline
from rag_ready_pipe.chunkers import SemanticChunker
pipeline = Pipeline(
chunker=SemanticChunker(
# Boundary detection
detect_boundaries=True,
boundaries=["paragraph", "section", "heading", "list"],
# Chunk sizing
target_chunk_size=512, # tokens
min_chunk_size=256,
max_chunk_size=768,
# Overlap strategy
overlap_by_sentences=True, # Overlap by full sentences
overlap_sentences=2, # 2 sentence overlap
# Semantic preservation
keep_together=["code_blocks", "tables", "lists"],
split_on=["\n\n", "\n", "."] # Prioritize these split points
)
)
# Chunks now:
# - Respect paragraph boundaries
# - Keep code blocks together
# - Overlap by sentences, not tokens
# - Result: Better retrieval, more complete answers
Problem
Pages with mixed content (text, code, tables, images) got poor quality embeddings when using a single embedding model for everything.
What I Tried
Attempt 1: Used separate embedding models. Too complex to manage and query.
Attempt 2: Preprocessed code to text only. Lost semantic meaning.
Actual Fix
Used content-type-aware embedding selection. Different models for text, code, tables, with late interaction fusion for retrieval.
# Content-type-aware embeddings
from rag_ready_pipe import Pipeline
from rag_ready_pipe.embedders import MultiModelEmbedder
pipeline = Pipeline(
embedder=MultiModelEmbedder(
# Model selection
text_model="text-embedding-3-small",
code_model="text-embedding-3-large", # Better for code
table_model="jina-embeddings-v2", # Good for tables
# Automatic detection
detect_content_type=True,
# Metadata storage
store_content_type=True,
store_embedding_model=True,
# Retrieval fusion
fusion_method="late_interaction", # Combine at query time
normalize_embeddings=True
)
)
What I Learned
- Readability scoring beats selector-based cleaning: More robust across different layouts.
- Semantic chunking is essential for quality: Respect boundaries, don't just count tokens.
- Different content needs different models: Code needs different embeddings than prose.
- Metadata is crucial for retrieval: Store content type, source, and timestamp with embeddings.
- Quality metrics prevent bad data: Monitor retrieval quality and adjust pipeline accordingly.
Production Setup
# Install RAG-Ready-Pipe
pip install rag-ready-pipe
# Install vector database clients
pip install pinecone-client weaviate-client
Production pipeline:
from rag_ready_pipe import Pipeline
from rag_ready_pipe.vectordb import PineconeWriter
import asyncio
class RAGPipeline:
def __init__(self):
self.pipeline = Pipeline(
scraper="playwright",
cleaner="ml_based",
chunker="semantic",
embedder="multi_model"
)
self.writer = PineconeWriter(
api_key="your-key",
index_name="web-content"
)
async def process_urls(self, urls: list[str]):
"""Process URLs and store embeddings."""
for url in urls:
# Scrape, clean, chunk, embed
chunks = await self.pipeline.process(url)
# Store with metadata
await self.writer.write(chunks)
return len(chunks)
# Usage
async def main():
rag = RAGPipeline()
count = await rag.process_urls([
"https://example.com/article1",
"https://example.com/article2"
])
print(f"Processed {count} chunks")
asyncio.run(main())