Scrapegraph-ai: AI-Powered Web Scraping

Using LLMs to extract structured data from websites. My notes and experiments.

What Is Scrapegraph-ai?

Scrapegraph-ai is a Python library that uses LLMs to scrape websites. Instead of writing CSS selectors or XPath queries, you tell the AI what data you want and it figures out how to extract it.

The idea is pretty cool: you give it a URL and a prompt like "extract all product names and prices", and it uses GPT-4 (or other models) to understand the page structure and pull out the relevant data.

I've been experimenting with it for a project where the target sites have inconsistent HTML structures. Traditional scraping would require writing separate parsers for each site. With Scrapegraph-ai, the AI handles the structure variations automatically.

How It Works

Under the hood, it uses a graph-based approach:

  • Fetches the page - Uses Playwright to render JavaScript
  • Parses HTML - Converts to a format the LLM can understand
  • LLM extraction - Sends the content to an LLM with your prompt
  • Structured output - Returns JSON with the extracted data

Why Use It?

  • No manual selectors: Don't need to inspect HTML and write CSS selectors
  • Handles dynamic content: Works with JavaScript-rendered pages
  • Flexible output: Can extract any data structure you describe
  • Resilient to changes: If the site layout changes, the AI adapts

The Downsides

It's not magic. There are some real limitations:

  • Cost: Each scrape uses LLM tokens, which adds up
  • Speed: Slower than traditional scraping (LLM API calls take time)
  • Reliability: Sometimes hallucinates data or misses things
  • Rate limiting: Still subject to the same anti-scraping measures

Installation

Install with pip:

pip install scrapegraph-ai

You'll also need an OpenAI API key (or whatever LLM you're using):

export OPENAI_API_KEY="your-key-here"

Or set it in your Python code:

import os
os.environ["OPENAI_API_KEY"] = "your-key-here"

It uses Playwright under the hood, so you might need to install browsers:

playwright install chromium

Basic Usage

SmartScraperGraph

The simplest way to get started:

from scrapegraphai.graphs import SmartScraperGraph

# Define what you want to extract
graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o",
    },
    "verbose": True,
    "headless": True,
}

# Create the graph
smart_scraper = SmartScraperGraph(
    prompt="List all product names, prices, and descriptions",
    source="https://example-shop.com/products",
    config=graph_config
)

# Run it
result = smart_scraper.run()
print(result)

The output will be JSON like:

{
  "products": [
    {
      "name": "Product A",
      "price": "$29.99",
      "description": "A great product"
    },
    {
      "name": "Product B",
      "price": "$49.99",
      "description": "Even better"
    }
  ]
}

Pretty clean. The AI figured out the structure on its own.

SearchScraperGraph

If you need to search first:

from scrapegraphai.graphs import SearchScraperGraph

search_graph = SearchScraperGraph(
    prompt="Extract the top 5 results for 'python tutorials'",
    config=graph_config
)

result = search_graph.run()
print(result)

This does a Google search first, then scrapes the results.

DeepScraperGraph

For multi-page scraping (follows links):

from scrapegraphai.graphs import DeepScraperGraph

deep_scraper = DeepScraperGraph(
    prompt="Extract all blog post titles and links from this page and linked pages",
    source="https://example.com/blog",
    config=graph_config,
    depth=2  # How many levels deep to follow
)

result = deep_scraper.run()
print(result)

Be careful with depth - it can get expensive fast with LLM calls.

Building Custom Graphs

The real power is creating custom graphs for specific workflows:

from scrapegraphai.nodes import (
    FetchNode,
    ParseNode,
    GenerateAnswerNode,
    GraphNode
)
from scrapegraphai.graphs import BaseGraph

# Define custom nodes
fetch_node = FetchNode(
    node_name="Fetch",
    node_params={
        "loader_kwargs": {
            "headless": True
        }
    }
)

parse_node = ParseNode(
    node_name="Parse",
    node_params={}
)

gen_answer_node = GenerateAnswerNode(
    node_name="GenerateAnswer",
    node_params={
        "llm_config": {
            "api_key": os.getenv("OPENAI_API_KEY"),
            "model": "openai/gpt-4o"
        }
    }
)

# Connect nodes
custom_graph = BaseGraph(
    nodes=[
        fetch_node,
        parse_node,
        gen_answer_node
    ],
    edges=[
        ("Fetch", "Parse"),
        ("Parse", "GenerateAnswer")
    ]
)

result = custom_graph.run({
    "url": "https://example.com",
    "user_prompt": "Extract all article titles and dates"
})

print(result)

This lets you add custom logic between fetching and extracting. Useful for things like filtering, validation, or post-processing.

Using Different LLMs

Not limited to OpenAI. Works with other providers:

Ollama (Local Models)

graph_config = {
    "llm": {
        "model": "ollama/llama3",
        "base_url": "http://localhost:11434",
    },
    "verbose": True,
    "headless": True,
}

smart_scraper = SmartScraperGraph(
    prompt="Extract product data",
    source="https://example.com",
    config=graph_config
)

This is great for cost savings. No API fees, just run it locally. The downside is smaller models might not extract as accurately.

Groq

graph_config = {
    "llm": {
        "api_key": os.getenv("GROQ_API_KEY"),
        "model": "groq/llama3-70b-8192",
    },
    "verbose": True,
    "headless": True,
}

Groq is faster and cheaper than OpenAI. Good for experimentation.

Azure OpenAI

graph_config = {
    "llm": {
        "api_key": os.getenv("AZURE_OPENAI_API_KEY"),
        "model": "azure/openai/gpt-4",
        "api_base": "https://your-resource.openai.azure.com",
    },
    "verbose": True,
    "headless": True,
}

Cost Optimization

This is the big one. LLM scraping gets expensive fast. Here's what I learned:

1. Use Cheaper Models for Simple Tasks

# For simple extraction
graph_config = {
    "llm": {
        "model": "openai/gpt-3.5-turbo",  # Cheaper than GPT-4
    }
}

# For complex extraction
graph_config = {
    "llm": {
        "model": "openai/gpt-4o",  # More accurate
    }
}

2. Cache Results

import hashlib
import json

def get_cache_key(url, prompt):
    return hashlib.md5(f"{url}:{prompt}".encode()).hexdigest()

def scrape_with_cache(url, prompt):
    cache_key = get_cache_key(url, prompt)

    # Check cache first
    try:
        with open(f"cache/{cache_key}.json") as f:
            return json.load(f)
    except FileNotFoundError:
        pass

    # Scrape if not cached
    result = smart_scraper.run()

    # Save to cache
    with open(f"cache/{cache_key}.json", "w") as f:
        json.dump(result, f)

    return result

3. Pre-filter Content

Don't send the entire page to the LLM. Extract just what you need first:

from bs4 import BeautifulSoup

# Use traditional scraping to narrow down
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Get only the product section
product_section = soup.find('div', class_='products')

# Send only that section to the LLM
result = smart_scraper.run({
    "source": str(product_section),
    "prompt": "Extract product data"
})

This dramatically reduces token usage.

4. Batch Requests

urls = ["url1", "url2", "url3", ...]

# Process in batches to avoid overwhelming the API
batch_size = 5
for i in range(0, len(urls), batch_size):
    batch = urls[i:i+batch_size]

    results = []
    for url in batch:
        result = scrape_with_cache(url, prompt)
        results.append(result)

    # Save batch results
    with open(f"batch_{i}.json", "w") as f:
        json.dump(results, f)

Issues I Ran Into

Hallucinations

Sometimes the AI makes up data that doesn't exist on the page.

Fix: Add validation:

def validate_extracted_data(extracted, original_html):
    # Check if extracted values actually exist in the HTML
    for item in extracted['products']:
        if item['name'] not in original_html:
            print(f"Warning: {item['name']} not found in page")
            # Remove or flag the hallucinated item

    return extracted

Inconsistent Output

Same prompt, different results across runs.

Fix: Use structured output with JSON schema:

graph_config = {
    "llm": {
        "model": "openai/gpt-4o",
    },
    "output_format": {
        "type": "json_schema",
        "json_schema": {
            "type": "object",
            "properties": {
                "products": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {"type": "string"},
                            "price": {"type": "string"},
                            "in_stock": {"type": "boolean"}
                        },
                        "required": ["name", "price"]
                    }
                }
            }
        }
    }
}

Slow Performance

LLM API calls are slow, especially for large pages.

Fix: Use faster providers (Groq) or smaller models (GPT-3.5) when appropriate.

Rate Limiting

Still get blocked by target sites.

Fix: Same as traditional scraping - use proxies, add delays, rotate user agents. The AI doesn't magically bypass anti-scraping measures.

When to Use AI Scraping

After using it for a while, here's where it makes sense:

Good For:

  • Sites with inconsistent HTML
  • One-off scraping tasks
  • Prototyping and exploration
  • Complex data extraction patterns
  • When time > cost

Not Great For:

  • Large-scale scraping
  • Simple, predictable sites
  • Real-time extraction
  • Budget-constrained projects
  • When you need 100% accuracy

I use it alongside traditional scraping. AI for the complex/unknown sites, BeautifulSoup/Playwright for the straightforward ones.

Final Thoughts

Scrapegraph-ai is a powerful tool but not a replacement for traditional scraping. It's more like a specialized tool for specific situations.

The cost adds up faster than you'd expect. I burned through $50 in API credits before I realized I needed to be more strategic about when to use it.

That said, when it works, it's pretty impressive. Watching it extract data from a messy, inconsistent page without any manual selector work is satisfying.

Link to the project: github.com/VinciGit00/Scrapegraph-ai

Docs: scrapegraph-ai.readthedocs.io