How it started
Been working on a web scraper project lately. Wanted to crawl Douban Top 250 movies and extract structured data. The twist is using DeepSeek AI to parse the unstructured movie info instead of regex.
Traditional parsing with regex is fragile - HTML structure changes and everything breaks. LLMs handle format variations much better.
The full project is on GitHub: github.com/stars1324/python-ai-spider
The idea
A movie data scraper that:
- Fetches 250 movies from Douban Top 250
- Uses AI (DeepSeek) to parse unstructured info
- Stores data in SQLite
- Generates charts and statistics
Project structure
python-ai-spider/
├── core/
│ ├── spider.py # Web scraping logic
│ ├── ai_engine.py # AI parsing engine
│ └── database.py # Database operations
├── utils/
│ ├── config.py # Configuration
│ └── logger.py # Logging
├── analysis/
│ └── charts.py # Data visualization
├── data/ # SQLite database
├── logs/ # Log files
├── main.py # Entry point
└── requirements.txt
Environment setup
Python 3.10+ required. Set up virtual environment:
# Clone the repo
git clone https://github.com/stars1324/python-ai-spider.git
cd python-ai-spider
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
Key dependencies:
- httpx - HTTP client with async support
- beautifulsoup4 - HTML parsing
- openai - LLM API client (compatible with DeepSeek)
- pandas - Data analysis
- matplotlib - Visualization
API setup
Need a DeepSeek API key. Get one from platform.deepseek.com. Free tier is enough for this project.
Set it as environment variable:
# Linux/Mac
export DEEPSEEK_API_KEY="your-api-key-here"
# Windows (PowerShell)
$env:DEEPSEEK_API_KEY="your-api-key-here"
# Or create .env file:
echo "DEEPSEEK_API_KEY=your-api-key-here" > .env
Don't commit API keys to git. Add .env to .gitignore.
Analyzing the page
Open https://movie.douban.com/top250 in browser. Use DevTools (F12) to inspect HTML structure.
Each movie item has an info block with director/actors/year/country/genres. Format varies across movies - some fields might be missing or in different order. That's where AI helps.
Basic spider
import httpx
from bs4 import BeautifulSoup
def fetch_page(page_num=1):
"""Fetch a single page from Douban Top 250"""
base_url = "https://movie.douban.com/top250"
start = (page_num - 1) * 25
url = f"{base_url}?start={start}"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
}
response = httpx.get(url, headers=headers, timeout=10)
if response.status_code == 200:
return response.text
return None
def parse_movies(html):
"""Extract movie data from HTML"""
soup = BeautifulSoup(html, 'html.parser')
items = soup.find_all('div', class_='item')
movies = []
for idx, item in enumerate(items, 1):
movie = {
'rank': idx,
'title': item.find('span', class_='title').text.strip(),
'rating': float(item.find('span', class_='rating_num').text),
'info_text': item.find('div', class_='bd').find('p', class_='').text.strip()
}
movies.append(movie)
return movies
Next part covers using AI to parse the info_text into structured JSON.
Anti-scraping
Douban will block aggressive requests. Added:
- Random delays between requests (1-3 seconds)
- Rotating User-Agent headers
- Error handling and retry logic
- Rate limiting
import random
import time
# Random delay
time.sleep(random.uniform(1, 3))
# Random user agent
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Chrome/120.0.0.0",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Firefox/121.0"
]
headers = {'User-Agent': random.choice(USER_AGENTS)}
Issues I ran into
403 Forbidden
Happened when scraping too fast. Fixed by:
- Increasing delay between requests
- Checking if IP is blocked
- Trying again later
Chinese characters not displaying
In charts mostly. Fixed by configuring matplotlib fonts:
matplotlib.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei']
HTML structure changed
Selectors breaking. Fixed by using more generic selectors and role attributes when possible.
What's next
Next part covers setting up DeepSeek API and building the AI parsing engine.