Zero-Shot Extractor: Data Extraction That Actually Understands Context
Extracting structured data from documents without writing regex was my goal. Zero-Shot Extractor promised "just say what you want" extraction, but it would miss fields, extract wrong data types, or fail on edge cases. Here's how I got reliable zero-shot extraction.
Problem
Asked to extract "contract amount and expiration date", but extractor would get the amount wrong or miss the date entirely. Accuracy around 70%.
Actual Fix
Used few-shot prompting with examples and schema validation. Providing 2-3 examples of correct extraction improved accuracy to 95%.
Problem
Dates came out as "Jan 5, 2024", "01/05/2024", "January 5th, 2024" - all different formats.
Actual Fix
Pydantic models with date validators and explicit format instructions in prompts. All dates now normalized to ISO 8601.
Production Setup
from zero_shot_extractor import Extractor
from pydantic import BaseModel
class ContractData(BaseModel):
amount: float
expiration_date: str
parties: list[str]
extractor = Extractor(
model="gpt-4o",
examples=[{
"text": "Contract between ABC Corp and XYZ Inc for $50,000 expiring Dec 31, 2024",
"output": {"amount": 50000, "expiration_date": "2024-12-31", "parties": ["ABC Corp", "XYZ Inc"]}
}]
)
result = extractor.extract("Your contract text here", schema=ContractData)