Building an AI-Powered Web Scraper

Using DeepSeek to parse unstructured data from web pages

How it started

Been working on a web scraper project lately. Wanted to crawl Douban Top 250 movies and extract structured data. The twist is using DeepSeek AI to parse the unstructured movie info instead of regex.

Traditional parsing with regex is fragile - HTML structure changes and everything breaks. LLMs handle format variations much better.

The full project is on GitHub: github.com/stars1324/python-ai-spider

The idea

A movie data scraper that:

  • Fetches 250 movies from Douban Top 250
  • Uses AI (DeepSeek) to parse unstructured info
  • Stores data in SQLite
  • Generates charts and statistics

Project structure

python-ai-spider/
├── core/
│   ├── spider.py          # Web scraping logic
│   ├── ai_engine.py       # AI parsing engine
│   └── database.py        # Database operations
├── utils/
│   ├── config.py          # Configuration
│   └── logger.py          # Logging
├── analysis/
│   └── charts.py          # Data visualization
├── data/                  # SQLite database
├── logs/                  # Log files
├── main.py                # Entry point
└── requirements.txt

Environment setup

Python 3.10+ required. Set up virtual environment:

# Clone the repo
git clone https://github.com/stars1324/python-ai-spider.git
cd python-ai-spider

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Key dependencies:

  • httpx - HTTP client with async support
  • beautifulsoup4 - HTML parsing
  • openai - LLM API client (compatible with DeepSeek)
  • pandas - Data analysis
  • matplotlib - Visualization

API setup

Need a DeepSeek API key. Get one from platform.deepseek.com. Free tier is enough for this project.

Set it as environment variable:

# Linux/Mac
export DEEPSEEK_API_KEY="your-api-key-here"

# Windows (PowerShell)
$env:DEEPSEEK_API_KEY="your-api-key-here"

# Or create .env file:
echo "DEEPSEEK_API_KEY=your-api-key-here" > .env

Don't commit API keys to git. Add .env to .gitignore.

Analyzing the page

Open https://movie.douban.com/top250 in browser. Use DevTools (F12) to inspect HTML structure.

Each movie item has an info block with director/actors/year/country/genres. Format varies across movies - some fields might be missing or in different order. That's where AI helps.

Basic spider

import httpx
from bs4 import BeautifulSoup

def fetch_page(page_num=1):
    """Fetch a single page from Douban Top 250"""
    base_url = "https://movie.douban.com/top250"
    start = (page_num - 1) * 25
    url = f"{base_url}?start={start}"

    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
    }

    response = httpx.get(url, headers=headers, timeout=10)

    if response.status_code == 200:
        return response.text
    return None

def parse_movies(html):
    """Extract movie data from HTML"""
    soup = BeautifulSoup(html, 'html.parser')
    items = soup.find_all('div', class_='item')

    movies = []
    for idx, item in enumerate(items, 1):
        movie = {
            'rank': idx,
            'title': item.find('span', class_='title').text.strip(),
            'rating': float(item.find('span', class_='rating_num').text),
            'info_text': item.find('div', class_='bd').find('p', class_='').text.strip()
        }
        movies.append(movie)

    return movies

Next part covers using AI to parse the info_text into structured JSON.

Anti-scraping

Douban will block aggressive requests. Added:

  • Random delays between requests (1-3 seconds)
  • Rotating User-Agent headers
  • Error handling and retry logic
  • Rate limiting
import random
import time

# Random delay
time.sleep(random.uniform(1, 3))

# Random user agent
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Chrome/120.0.0.0",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Firefox/121.0"
]
headers = {'User-Agent': random.choice(USER_AGENTS)}

Issues I ran into

403 Forbidden

Happened when scraping too fast. Fixed by:

  • Increasing delay between requests
  • Checking if IP is blocked
  • Trying again later

Chinese characters not displaying

In charts mostly. Fixed by configuring matplotlib fonts:

matplotlib.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei']

HTML structure changed

Selectors breaking. Fixed by using more generic selectors and role attributes when possible.

What's next

Next part covers setting up DeepSeek API and building the AI parsing engine.