Using DeepSeek AI for Data Parsing

Parsing unstructured text with LLM instead of regex

The problem I was trying to solve

Part 1 got us raw HTML with unstructured movie info. Now need to parse that info_text into structured data.

Why not regex?

Here's what I was dealing with:

导演: 克里斯托弗·诺兰 / 主演: 莱昂纳多·迪卡普里奥 / 2010年 / 美国 / 动作 / 科幻

Problem is the format varies:

  • Sometimes director comes first, sometimes not
  • Actor names might be in English or Chinese
  • Some fields are missing entirely
  • Delimiters vary (/ vs commas vs spaces)

With regex, I'd need dozens of patterns to cover all cases. With AI, just tell it what to extract.

DeepSeek setup

DeepSeek is compatible with OpenAI's API. Used the openai Python library:

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.getenv("DEEPSEEK_API_KEY"),
    base_url="https://api.deepseek.com"
)

# Test
response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[{"role": "user", "content": "Say hello"}]
)
print(response.choices[0].message.content)

Worked on first try. DeepSeek's free tier has been generous enough for my testing.

Prompt design

The prompt matters a lot. Here's what worked:

SYSTEM_PROMPT = """You are a data extraction assistant.
Extract movie information from text and return as valid JSON."""

USER_PROMPT = """Extract from this text:
{text}

Return ONLY JSON in this format:
{
    "director": "string",
    "actors": ["actor1", "actor2"],
    "year": 2000,
    "country": "string",
    "genres": ["genre1", "genre2"]
}

If a field is missing, use null or []."""

Key things that helped:

  • Specify JSON output explicitly
  • Define fallback values for missing fields
  • Use low temperature (0.1) for consistent output

The AI engine

import json
from openai import OpenAI
from typing import Dict, Any, Optional

class AIEngine:
    def __init__(self):
        self.client = OpenAI(
            api_key=os.getenv("DEEPSEEK_API_KEY"),
            base_url="https://api.deepseek.com"
        )
        self.model = "deepseek-chat"

    def parse_movie_info(self, info_text: str) -> Optional[Dict[str, Any]]:
        """Parse unstructured movie text into structured JSON"""

        user_prompt = f"""Extract from this text:
{info_text}

Return ONLY JSON:
{{
    "director": "string",
    "actors": ["actor1", "actor2"],
    "year": 2000,
    "country": "string",
    "genres": ["genre1", "genre2"]
}}"""

        try:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": "You are a data extraction assistant. Return valid JSON."},
                    {"role": "user", "content": user_prompt}
                ],
                temperature=0.1,
                response_format={"type": "json_object"},
                timeout=30
            )

            content = response.choices[0].message.content
            return json.loads(content)

        except Exception as e:
            print(f"Error parsing: {e}")
            return None

The response_format={"type": "json_object"} forces JSON output. Saved me from writing JSON validation logic.

Handling edge cases

Real data is messy. Added validation:

def parse_movie_info(self, info_text: str) -> Optional[Dict[str, Any]]:
    # ... API call ...

    # Validate and clean response
    required_fields = ['director', 'actors', 'year', 'country', 'genres']
    result = {}

    for field in required_fields:
        if field in data:
            value = data[field]

            # Type conversions
            if field == 'actors' and not isinstance(value, list):
                value = [value] if value else []
            elif field == 'genres' and not isinstance(value, list):
                value = [value] if value else []
            elif field == 'year':
                try:
                    value = int(value)
                except (ValueError, TypeError):
                    value = None

            result[field] = value
        else:
            result[field] = None

    return result

Batch processing

Processing 250 movies one by one takes time. Added progress tracking:

def parse_movie_batch(self, movies: list) -> list:
    """Parse a batch of movies with progress tracking"""
    processed = []
    total = len(movies)

    for idx, movie in enumerate(movies, 1):
        print(f"Processing {idx}/{total}: {movie.get('title', 'Unknown')}")

        info_text = movie.get('info_text', '')
        parsed = self.parse_movie_info(info_text)

        if parsed:
            movie.update(parsed)
        else:
            # Set defaults if parsing failed
            movie.update({
                'director': None,
                'actors': [],
                'year': None,
                'country': None,
                'genres': []
            })

        processed.append(movie)

    return processed

Retry logic

APIs fail sometimes. Added retries with exponential backoff:

import time

def _call_api_with_retry(self, prompt: str, max_retries: int = 3):
    """Call API with retry logic"""

    for attempt in range(max_retries):
        try:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[...],
                timeout=30
            )
            return response.choices[0].message.content

        except Timeout:
            print(f"Timeout on attempt {attempt + 1}")
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
                continue

        except Exception as e:
            print(f"Error: {e}")
            if attempt == max_retries - 1:
                return None

    return None

Cost optimization

API calls add up. Things that helped:

  • Cache results: Added @lru_cache to avoid duplicate calls
  • DeepSeek: Cheaper than GPT-4 for this use case
  • Fallback: For simple cases, regex still works

What I learned

  • DeepSeek works well for structured extraction
  • JSON mode is reliable - got valid JSON every time
  • Low temperature (0.1) keeps output consistent
  • Caching is essential when processing batches
  • Rate limits happen faster than expected

One thing I haven't figured out - batch processing. DeepSeek doesn't support it yet. If anyone knows a workaround, let me know.