Zero-Shot Extractor: Data Extraction That Actually Understands Context

Extracting structured data from documents without writing regex was my goal. Zero-Shot Extractor promised "just say what you want" extraction, but it would miss fields, extract wrong data types, or fail on edge cases. Here's how I got reliable zero-shot extraction.

Problem

Asked to extract "contract amount and expiration date", but extractor would get the amount wrong or miss the date entirely. Accuracy around 70%.

Actual Fix

Used few-shot prompting with examples and schema validation. Providing 2-3 examples of correct extraction improved accuracy to 95%.

Problem

Dates came out as "Jan 5, 2024", "01/05/2024", "January 5th, 2024" - all different formats.

Actual Fix

Pydantic models with date validators and explicit format instructions in prompts. All dates now normalized to ISO 8601.

Production Setup

from zero_shot_extractor import Extractor
from pydantic import BaseModel

class ContractData(BaseModel):
    amount: float
    expiration_date: str
    parties: list[str]

extractor = Extractor(
    model="gpt-4o",
    examples=[{
        "text": "Contract between ABC Corp and XYZ Inc for $50,000 expiring Dec 31, 2024",
        "output": {"amount": 50000, "expiration_date": "2024-12-31", "parties": ["ABC Corp", "XYZ Inc"]}
    }]
)

result = extractor.extract("Your contract text here", schema=ContractData)

Related Resources