What Is Scrapegraph-ai?
Scrapegraph-ai is a Python library that uses LLMs to scrape websites. Instead of writing CSS selectors or XPath queries, you tell the AI what data you want and it figures out how to extract it.
The idea is pretty cool: you give it a URL and a prompt like "extract all product names and prices", and it uses GPT-4 (or other models) to understand the page structure and pull out the relevant data.
I've been experimenting with it for a project where the target sites have inconsistent HTML structures. Traditional scraping would require writing separate parsers for each site. With Scrapegraph-ai, the AI handles the structure variations automatically.
How It Works
Under the hood, it uses a graph-based approach:
- Fetches the page - Uses Playwright to render JavaScript
- Parses HTML - Converts to a format the LLM can understand
- LLM extraction - Sends the content to an LLM with your prompt
- Structured output - Returns JSON with the extracted data
Why Use It?
- No manual selectors: Don't need to inspect HTML and write CSS selectors
- Handles dynamic content: Works with JavaScript-rendered pages
- Flexible output: Can extract any data structure you describe
- Resilient to changes: If the site layout changes, the AI adapts
The Downsides
It's not magic. There are some real limitations:
- Cost: Each scrape uses LLM tokens, which adds up
- Speed: Slower than traditional scraping (LLM API calls take time)
- Reliability: Sometimes hallucinates data or misses things
- Rate limiting: Still subject to the same anti-scraping measures
Installation
Install with pip:
pip install scrapegraph-ai
You'll also need an OpenAI API key (or whatever LLM you're using):
export OPENAI_API_KEY="your-key-here"
Or set it in your Python code:
import os
os.environ["OPENAI_API_KEY"] = "your-key-here"
It uses Playwright under the hood, so you might need to install browsers:
playwright install chromium
Basic Usage
SmartScraperGraph
The simplest way to get started:
from scrapegraphai.graphs import SmartScraperGraph
# Define what you want to extract
graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o",
},
"verbose": True,
"headless": True,
}
# Create the graph
smart_scraper = SmartScraperGraph(
prompt="List all product names, prices, and descriptions",
source="https://example-shop.com/products",
config=graph_config
)
# Run it
result = smart_scraper.run()
print(result)
The output will be JSON like:
{
"products": [
{
"name": "Product A",
"price": "$29.99",
"description": "A great product"
},
{
"name": "Product B",
"price": "$49.99",
"description": "Even better"
}
]
}
Pretty clean. The AI figured out the structure on its own.
SearchScraperGraph
If you need to search first:
from scrapegraphai.graphs import SearchScraperGraph
search_graph = SearchScraperGraph(
prompt="Extract the top 5 results for 'python tutorials'",
config=graph_config
)
result = search_graph.run()
print(result)
This does a Google search first, then scrapes the results.
DeepScraperGraph
For multi-page scraping (follows links):
from scrapegraphai.graphs import DeepScraperGraph
deep_scraper = DeepScraperGraph(
prompt="Extract all blog post titles and links from this page and linked pages",
source="https://example.com/blog",
config=graph_config,
depth=2 # How many levels deep to follow
)
result = deep_scraper.run()
print(result)
Be careful with depth - it can get expensive fast with LLM calls.
Building Custom Graphs
The real power is creating custom graphs for specific workflows:
from scrapegraphai.nodes import (
FetchNode,
ParseNode,
GenerateAnswerNode,
GraphNode
)
from scrapegraphai.graphs import BaseGraph
# Define custom nodes
fetch_node = FetchNode(
node_name="Fetch",
node_params={
"loader_kwargs": {
"headless": True
}
}
)
parse_node = ParseNode(
node_name="Parse",
node_params={}
)
gen_answer_node = GenerateAnswerNode(
node_name="GenerateAnswer",
node_params={
"llm_config": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o"
}
}
)
# Connect nodes
custom_graph = BaseGraph(
nodes=[
fetch_node,
parse_node,
gen_answer_node
],
edges=[
("Fetch", "Parse"),
("Parse", "GenerateAnswer")
]
)
result = custom_graph.run({
"url": "https://example.com",
"user_prompt": "Extract all article titles and dates"
})
print(result)
This lets you add custom logic between fetching and extracting. Useful for things like filtering, validation, or post-processing.
Using Different LLMs
Not limited to OpenAI. Works with other providers:
Ollama (Local Models)
graph_config = {
"llm": {
"model": "ollama/llama3",
"base_url": "http://localhost:11434",
},
"verbose": True,
"headless": True,
}
smart_scraper = SmartScraperGraph(
prompt="Extract product data",
source="https://example.com",
config=graph_config
)
This is great for cost savings. No API fees, just run it locally. The downside is smaller models might not extract as accurately.
Groq
graph_config = {
"llm": {
"api_key": os.getenv("GROQ_API_KEY"),
"model": "groq/llama3-70b-8192",
},
"verbose": True,
"headless": True,
}
Groq is faster and cheaper than OpenAI. Good for experimentation.
Azure OpenAI
graph_config = {
"llm": {
"api_key": os.getenv("AZURE_OPENAI_API_KEY"),
"model": "azure/openai/gpt-4",
"api_base": "https://your-resource.openai.azure.com",
},
"verbose": True,
"headless": True,
}
Cost Optimization
This is the big one. LLM scraping gets expensive fast. Here's what I learned:
1. Use Cheaper Models for Simple Tasks
# For simple extraction
graph_config = {
"llm": {
"model": "openai/gpt-3.5-turbo", # Cheaper than GPT-4
}
}
# For complex extraction
graph_config = {
"llm": {
"model": "openai/gpt-4o", # More accurate
}
}
2. Cache Results
import hashlib
import json
def get_cache_key(url, prompt):
return hashlib.md5(f"{url}:{prompt}".encode()).hexdigest()
def scrape_with_cache(url, prompt):
cache_key = get_cache_key(url, prompt)
# Check cache first
try:
with open(f"cache/{cache_key}.json") as f:
return json.load(f)
except FileNotFoundError:
pass
# Scrape if not cached
result = smart_scraper.run()
# Save to cache
with open(f"cache/{cache_key}.json", "w") as f:
json.dump(result, f)
return result
3. Pre-filter Content
Don't send the entire page to the LLM. Extract just what you need first:
from bs4 import BeautifulSoup
# Use traditional scraping to narrow down
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Get only the product section
product_section = soup.find('div', class_='products')
# Send only that section to the LLM
result = smart_scraper.run({
"source": str(product_section),
"prompt": "Extract product data"
})
This dramatically reduces token usage.
4. Batch Requests
urls = ["url1", "url2", "url3", ...]
# Process in batches to avoid overwhelming the API
batch_size = 5
for i in range(0, len(urls), batch_size):
batch = urls[i:i+batch_size]
results = []
for url in batch:
result = scrape_with_cache(url, prompt)
results.append(result)
# Save batch results
with open(f"batch_{i}.json", "w") as f:
json.dump(results, f)
Issues I Ran Into
Hallucinations
Sometimes the AI makes up data that doesn't exist on the page.
Fix: Add validation:
def validate_extracted_data(extracted, original_html):
# Check if extracted values actually exist in the HTML
for item in extracted['products']:
if item['name'] not in original_html:
print(f"Warning: {item['name']} not found in page")
# Remove or flag the hallucinated item
return extracted
Inconsistent Output
Same prompt, different results across runs.
Fix: Use structured output with JSON schema:
graph_config = {
"llm": {
"model": "openai/gpt-4o",
},
"output_format": {
"type": "json_schema",
"json_schema": {
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "string"},
"in_stock": {"type": "boolean"}
},
"required": ["name", "price"]
}
}
}
}
}
}
Slow Performance
LLM API calls are slow, especially for large pages.
Fix: Use faster providers (Groq) or smaller models (GPT-3.5) when appropriate.
Rate Limiting
Still get blocked by target sites.
Fix: Same as traditional scraping - use proxies, add delays, rotate user agents. The AI doesn't magically bypass anti-scraping measures.
When to Use AI Scraping
After using it for a while, here's where it makes sense:
Good For:
- Sites with inconsistent HTML
- One-off scraping tasks
- Prototyping and exploration
- Complex data extraction patterns
- When time > cost
Not Great For:
- Large-scale scraping
- Simple, predictable sites
- Real-time extraction
- Budget-constrained projects
- When you need 100% accuracy
I use it alongside traditional scraping. AI for the complex/unknown sites, BeautifulSoup/Playwright for the straightforward ones.
Final Thoughts
Scrapegraph-ai is a powerful tool but not a replacement for traditional scraping. It's more like a specialized tool for specific situations.
The cost adds up faster than you'd expect. I burned through $50 in API credits before I realized I needed to be more strategic about when to use it.
That said, when it works, it's pretty impressive. Watching it extract data from a messy, inconsistent page without any manual selector work is satisfying.
Link to the project: github.com/VinciGit00/Scrapegraph-ai