The problem I was trying to solve
Part 1 got us raw HTML with unstructured movie info. Now need to parse that info_text into structured data.
Why not regex?
Here's what I was dealing with:
导演: 克里斯托弗·诺兰 / 主演: 莱昂纳多·迪卡普里奥 / 2010年 / 美国 / 动作 / 科幻
Problem is the format varies:
- Sometimes director comes first, sometimes not
- Actor names might be in English or Chinese
- Some fields are missing entirely
- Delimiters vary (/ vs commas vs spaces)
With regex, I'd need dozens of patterns to cover all cases. With AI, just tell it what to extract.
DeepSeek setup
DeepSeek is compatible with OpenAI's API. Used the openai Python library:
from openai import OpenAI
import os
client = OpenAI(
api_key=os.getenv("DEEPSEEK_API_KEY"),
base_url="https://api.deepseek.com"
)
# Test
response = client.chat.completions.create(
model="deepseek-chat",
messages=[{"role": "user", "content": "Say hello"}]
)
print(response.choices[0].message.content)
Worked on first try. DeepSeek's free tier has been generous enough for my testing.
Prompt design
The prompt matters a lot. Here's what worked:
SYSTEM_PROMPT = """You are a data extraction assistant.
Extract movie information from text and return as valid JSON."""
USER_PROMPT = """Extract from this text:
{text}
Return ONLY JSON in this format:
{
"director": "string",
"actors": ["actor1", "actor2"],
"year": 2000,
"country": "string",
"genres": ["genre1", "genre2"]
}
If a field is missing, use null or []."""
Key things that helped:
- Specify JSON output explicitly
- Define fallback values for missing fields
- Use low temperature (0.1) for consistent output
The AI engine
import json
from openai import OpenAI
from typing import Dict, Any, Optional
class AIEngine:
def __init__(self):
self.client = OpenAI(
api_key=os.getenv("DEEPSEEK_API_KEY"),
base_url="https://api.deepseek.com"
)
self.model = "deepseek-chat"
def parse_movie_info(self, info_text: str) -> Optional[Dict[str, Any]]:
"""Parse unstructured movie text into structured JSON"""
user_prompt = f"""Extract from this text:
{info_text}
Return ONLY JSON:
{{
"director": "string",
"actors": ["actor1", "actor2"],
"year": 2000,
"country": "string",
"genres": ["genre1", "genre2"]
}}"""
try:
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": "You are a data extraction assistant. Return valid JSON."},
{"role": "user", "content": user_prompt}
],
temperature=0.1,
response_format={"type": "json_object"},
timeout=30
)
content = response.choices[0].message.content
return json.loads(content)
except Exception as e:
print(f"Error parsing: {e}")
return None
The response_format={"type": "json_object"} forces JSON output. Saved me from writing JSON validation logic.
Handling edge cases
Real data is messy. Added validation:
def parse_movie_info(self, info_text: str) -> Optional[Dict[str, Any]]:
# ... API call ...
# Validate and clean response
required_fields = ['director', 'actors', 'year', 'country', 'genres']
result = {}
for field in required_fields:
if field in data:
value = data[field]
# Type conversions
if field == 'actors' and not isinstance(value, list):
value = [value] if value else []
elif field == 'genres' and not isinstance(value, list):
value = [value] if value else []
elif field == 'year':
try:
value = int(value)
except (ValueError, TypeError):
value = None
result[field] = value
else:
result[field] = None
return result
Batch processing
Processing 250 movies one by one takes time. Added progress tracking:
def parse_movie_batch(self, movies: list) -> list:
"""Parse a batch of movies with progress tracking"""
processed = []
total = len(movies)
for idx, movie in enumerate(movies, 1):
print(f"Processing {idx}/{total}: {movie.get('title', 'Unknown')}")
info_text = movie.get('info_text', '')
parsed = self.parse_movie_info(info_text)
if parsed:
movie.update(parsed)
else:
# Set defaults if parsing failed
movie.update({
'director': None,
'actors': [],
'year': None,
'country': None,
'genres': []
})
processed.append(movie)
return processed
Retry logic
APIs fail sometimes. Added retries with exponential backoff:
import time
def _call_api_with_retry(self, prompt: str, max_retries: int = 3):
"""Call API with retry logic"""
for attempt in range(max_retries):
try:
response = self.client.chat.completions.create(
model=self.model,
messages=[...],
timeout=30
)
return response.choices[0].message.content
except Timeout:
print(f"Timeout on attempt {attempt + 1}")
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
continue
except Exception as e:
print(f"Error: {e}")
if attempt == max_retries - 1:
return None
return None
Cost optimization
API calls add up. Things that helped:
- Cache results: Added
@lru_cacheto avoid duplicate calls - DeepSeek: Cheaper than GPT-4 for this use case
- Fallback: For simple cases, regex still works
What I learned
- DeepSeek works well for structured extraction
- JSON mode is reliable - got valid JSON every time
- Low temperature (0.1) keeps output consistent
- Caching is essential when processing batches
- Rate limits happen faster than expected
One thing I haven't figured out - batch processing. DeepSeek doesn't support it yet. If anyone knows a workaround, let me know.