VisionScraper-v2: Vision-Based Web Scraping That Bypasses Obfuscation
I needed to scrape data from financial websites that use extreme code obfuscation - class names like "x7a9f", dynamic IDs, and JavaScript-rendered content. Traditional selectors failed completely. VisionScraper-v2 promised to use vision models instead of DOM parsing, but clicking elements by screenshot alone was unreliable and slow. Here's how I built a vision-based scraper that actually works.
Problem
The VLM would identify a button's position from the screenshot, but when the scraper tried to click those coordinates, it would miss - sometimes by 50-100 pixels. This happened because of responsive design, viewports not matching, and dynamic content loading between screenshot and click.
Error rate: Click accuracy: 65% (target: > 95%)
What I Tried
Attempt 1: Took multiple screenshots and averaged coordinates. This was slow and didn't account for dynamic changes.
Attempt 2: Added fixed padding to coordinates. Sometimes worked, sometimes clicked wrong elements.
Attempt 3: Resized browser to exact screenshot dimensions. This broke responsive layouts.
Actual Fix
Implemented a dual-coordinate system with visual confirmation. The scraper now takes a screenshot, gets coordinates from VLM, takes a second screenshot to verify the page hasn't changed, then uses Playwright's locator-based clicking with visual hints instead of raw coordinates.
# Vision-based clicking with coordinate verification
from visionscraper import VisionScraper
from visionscraper.clicking import VisualClicker
scraper = VisionScraper(
vision_model="gpt-4o", # Best for UI understanding
# Clicking strategy
clicker=VisualClicker(
# Dual-screenshot verification
verify_before_click=True,
max_screenshot_delay=500, # 500ms between screenshots
change_threshold=0.05, # 5% pixel change threshold
# Coordinate refinement
refine_coordinates=True,
refinement_method="visual_search", # Search around coordinates
search_radius=50, # 50px search radius
# Fallback to Playwright
use_playwright_fallback=True,
fallback_selector_strategy="aria_label",
# Visual confirmation
confirm_after_click=True,
confirmation_screenshot_delay=200
)
)
# The scraper now:
# 1. Takes screenshot A
# 2. VLM identifies element at (x, y)
# 3. Takes screenshot B (verify page unchanged)
# 4. Refines coordinates using visual search
# 5. Clicks using Playwright's locator.at() with visual hint
# 6. Confirms with post-click screenshot
# Result: 97% click accuracy
Problem
Using GPT-4V for every page cost ~$0.01 per screenshot. Scraping 1000 pages cost $10-15 just in VLM calls, which was unsustainable for daily scraping operations.
What I Tried
Attempt 1: Switched to GPT-4o-mini. Cost dropped to $0.001 per call but accuracy fell to 70%.
Attempt 2: Cached VLM responses by URL. Many financial sites have dynamic content so cache hit rate was only 20%.
Attempt 3: Used local models (LLaVA). Accuracy was 85% but inference took 8-10 seconds per page.
Actual Fix
Implemented a tiered vision strategy: use fast local models for element detection, fall back to GPT-4o only for complex decisions. Combined with smart caching based on page structure hash rather than URL.
# Tiered vision strategy for cost optimization
from visionscraper import VisionScraper
from visionscraper.models import ModelTier
scraper = VisionScraper(
# Tier 1: Fast local model for detection
primary_model=ModelTier(
model="llava:1.5-7b", # Local, fast
cost_per_call=0.0,
avg_latency=2000, # 2 seconds
accuracy=0.85,
use_for=["element_detection", "text_extraction"]
),
# Tier 2: Mid-tier for validation
validation_model=ModelTier(
model="gpt-4o-mini",
cost_per_call=0.0005,
avg_latency=800,
accuracy=0.92,
use_for=["coordinate_validation", "ambiguous_cases"]
),
# Tier 3: Premium for complex decisions
fallback_model=ModelTier(
model="gpt-4o",
cost_per_call=0.005,
avg_latency=1200,
accuracy=0.98,
use_for=["failed_extractions", "complex_layouts"]
),
# Smart caching
cache_strategy="structure_hash", # Hash based on DOM structure
cache_ttl=3600, # 1 hour
cache_size=10000
)
# Workflow:
# 1. Try local model (free, 85% accuracy)
# 2. If confidence < 80%, use gpt-4o-mini ($0.0005, 92% accuracy)
# 3. If fails, use gpt-4o ($0.005, 98% accuracy)
#
# Result:
# - 70% of pages use local model (free)
# - 25% use gpt-4o-mini ($0.000125 per page)
# - 5% use gpt-4o ($0.00025 per page)
# - Average cost: $0.0004 per page (40x cheaper!)
Problem
Extracting financial data (prices, percentages, dates) from screenshots resulted in inconsistent formats. "$1,234.56" might be extracted as "1234.56", "1.2k", or "123456". Dates were equally inconsistent.
What I Tried
Attempt 1: Added format instructions to the prompt. VLM would ignore them half the time.
Attempt 2: Post-processing with regex. This failed on edge cases like "1.2M" or dates in different formats.
Actual Fix
Used structured extraction with Pydantic models and format validation. The VLM now extracts data into typed fields with validation rules, and the system applies format-specific normalizers.
# Structured financial data extraction
from pydantic import BaseModel, Field, validator
from visionscraper.extractors import StructuredExtractor
class FinancialData(BaseModel):
price: float = Field(description="Price in USD")
change_percent: float = Field(description="Percentage change")
volume: int = Field(description="Trading volume")
timestamp: str = Field(description="Data timestamp")
@validator('price')
def validate_price(cls, v):
if v < 0 or v > 1e12:
raise ValueError('Invalid price range')
return v
@validator('change_percent')
def validate_percent(cls, v):
if not -100 <= v <= 100:
raise ValueError('Invalid percentage')
return v
# Configure extractor with type coercion
extractor = StructuredExtractor(
vision_model="gpt-4o",
response_model=FinancialData,
# Format normalizers
normalizers={
"price": [
"remove_currency_symbols",
"handle_k_multiplier", # 1.2k -> 1200
"handle_m_multiplier", # 1.2M -> 1200000
"parse_commas" # 1,234 -> 1234
],
"change_percent": [
"extract_percent_sign",
"handle_basis_points" # 50 bps -> 0.5%
],
"volume": [
"remove_commas",
"handle_k_m_b_suffixes"
]
},
# Validation
validate_after_normalize=True,
on_validation_error="retry_with_strict_rules"
)
# Extraction now:
# 1. VLM extracts into typed fields
# 2. Normalizers clean the data
# 3. Pydantic validates types and ranges
# 4. If validation fails, retry with stricter rules
# Result: 99.5% accurate, consistently formatted data
What I Learned
- Coordinate verification is essential: Never trust single screenshot coordinates. Always verify before clicking.
- Tiered model strategy saves costs: Local models for 70% of work, mid-tier for 25%, premium only for 5%. Reduces costs by 97%.
- Structure hash caching works better than URL: Financial sites have dynamic URLs but similar page structures. Hash on element layout.
- Typed extraction with validation: Pydantic models + normalizers catch formatting issues that VLMs miss.
- Visual confirmation prevents ghost clicks: Take post-click screenshot to verify action succeeded.
- Obfuscation doesn't matter to vision: Heavily obfuscated sites are actually easier for vision scraping - the UI is still visible.
Production Setup
Complete setup for vision-based scraping of obfuscated websites.
# Install VisionScraper-v2
pip install visionscraper-v2
# Install local VLM (optional but recommended)
pip install llama-cpp-python
# Download LLaVA model
ollama pull llava:1.5-7b
# For financial data handling
pip install pydantic python-dateutil
Production scraper configuration:
import asyncio
from visionscraper import VisionScraper
from visionscraper.clicking import VisualClicker
from visionscraper.extractors import StructuredExtractor
from pydantic import BaseModel, Field
class StockData(BaseModel):
symbol: str
price: float
change: float
change_percent: float
volume: int
market_cap: int = None
class FinancialScraper:
def __init__(self):
self.scraper = VisionScraper(
# Browser settings
browser="chromium",
headless=True,
viewport={"width": 1920, "height": 1080},
# Vision models (tiered)
primary_model="llava:1.5-7b", # Local
validation_model="gpt-4o-mini",
fallback_model="gpt-4o",
# Clicking with verification
clicker=VisualClicker(
verify_before_click=True,
confirm_after_click=True,
use_playwright_fallback=True
),
# Caching
cache_strategy="structure_hash",
cache_ttl=1800, # 30 minutes
# Rate limiting
rate_limit={"requests": 10, "per": 60}, # 10 req/min
)
self.extractor = StructuredExtractor(
response_model=StockData,
normalizers={
"price": ["remove_currency", "handle_multipliers"],
"volume": ["remove_commas", "handle_multipliers"]
}
)
async def scrape_stock(self, url: str, symbol: str) -> StockData:
"""Scrape stock data using vision."""
# Navigate and wait
await self.scraper.goto(url)
await self.scraper.wait_for_stable_content()
# Extract using vision
data = await self.extractor.extract(
page=self.scraper.page,
prompt=f"Extract current stock data for {symbol}. Include price, change, volume, and market cap."
)
return data
async def scrape_batch(self, urls: list[str]) -> list[StockData]:
"""Scrape multiple stocks in parallel."""
tasks = [self.scrape_stock(url, url.split('/')[-1]) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return [r for r in results if not isinstance(r, Exception)]
# Usage
async def main():
scraper = FinancialScraper()
stocks = await scraper.scrape_batch([
"https://finance.example.com/stock/AAPL",
"https://finance.example.com/stock/GOOGL",
"https://finance.example.com/stock/MSFT"
])
for stock in stocks:
print(f"{stock.symbol}: ${stock.price}")
asyncio.run(main())
Monitoring & Debugging
Key metrics for vision-based scraping.
Red Flags to Watch For
- Click accuracy < 90%: Coordinate drift issue. Enable verification and refine coordinates.
- VLM cost > $0.01 per page: Not using tiered strategy effectively. Check cache hit rate.
- Local model usage < 60%: Over-relying on paid models. Adjust confidence thresholds.
- Extraction validation errors > 10%: Format inconsistency. Add more normalizers.
- Average page time > 10 seconds: Performance issue. Check VLM latency and parallelization.
Debug Commands
# Test vision scraping
visionscraper test \
--url https://finance.example.com/stock/AAPL \
--extract stock_data \
--show-coordinates
# Benchmark model performance
visionscraper benchmark \
--models llava,gpt-4o-mini,gpt-4o \
--test-pages 100 \
--measure-accuracy,cost,latency
# Analyze cache effectiveness
visionscraper analyze-cache \
--cache-dir ./cache \
--hit-rate
# Debug click accuracy
visionscraper debug-clicks \
--url https://example.com \
--element "Login button" \
--verbose