OmniParser: UI Element Extraction That Actually Works for Agents
I was building AI agents that needed to understand and interact with web interfaces. The problem was converting raw HTML into something agents could reason about - buttons, inputs, navigation, etc. OmniParser promised to extract UI elements into semantic structures, but it would miss elements, misclassify them, or produce outputs that agents couldn't use effectively. Here's how I got reliable element extraction that agents can actually understand.
Problem
OmniParser would fail to detect buttons and inputs that were inside shadow DOM or web components. Modern SPAs use these extensively, so 20-30% of interactive elements were being missed.
Detection rate: 72% (target: > 95%)
What I Tried
Attempt 1: Used accessibility tree instead of DOM. This found shadow DOM elements but lacked spatial information.
Attempt 2: Pre-rendered JavaScript with Playwright. This worked but added 10+ seconds per page.
Attempt 3: Custom shadow DOM traversal. This was fragile and broke on different component libraries.
Actual Fix
Combined accessibility tree scanning with visual element detection. OmniParser now uses the accessibility tree for shadow DOM elements and supplements with vision-based detection for anything missed. Also added component library-specific heuristics.
# OmniParser with shadow DOM support
from omniparser import OmniParser
from omniparser.detectors import AccessibilityDetector, VisionDetector
parser = OmniParser(
# Multi-modal detection
detectors=[
# Primary: Accessibility tree (catches shadow DOM)
AccessibilityDetector(
include_shadow_dom=True,
include_web_components=True,
get_aria_labels=True,
get_semantic_roles=True
),
# Secondary: Vision-based (catches visually detectable elements)
VisionDetector(
model="omniparser-vision",
confidence_threshold=0.7,
detect_without_labels=True # Find elements without text
)
],
# Merge strategy
merge_strategy="union", # Combine all detections
resolve_conflicts="accessibility_first", # Trust accessibility tree
# Component library heuristics
detect_known_components=True,
component_libraries=["react", "vue", "angular", "svelte"],
# Spatial reasoning
include_bounding_boxes=True,
include_z_index=True,
include_visibility=True
)
# Result:
# - Accessibility tree finds 85% of elements (including shadow DOM)
# - Vision detector catches the remaining 10%
# - Combined: 95%+ detection rate
Problem
Elements were labeled with generic types like "button" or "input" without context. Agents couldn't understand that a button was for "submitting the form" or an input was for "email address".
What I Tried
Attempt 1: Used surrounding text to infer purpose. This produced noisy labels.
Attempt 2: Prompted LLM to analyze each element individually. This was too slow for large pages.
Actual Fix
Implemented contextual semantic labeling with relationship extraction. Elements now get labels based on their purpose (not just type), their relationships to other elements (form, fieldset, etc.), and the page context.
# Contextual semantic labeling
from omniparser import OmniParser
from omniparser.semantic import ContextualLabeler
parser = OmniParser(
# Semantic labeling
labeler=ContextualLabeler(
# Context sources
use_form_context=True, # Understand form relationships
use_aria_context=True, # Use ARIA labels and descriptions
use_text_context=True, # Use surrounding text
use_visual_context=True, # Use visual hierarchy
# Label granularity
label_type="functional", # What it does, not just what it is
include_purpose=True, # "submit_button" not just "button"
include_domain=True, # Domain-specific labels
# Relationship extraction
extract_relationships=True,
relationships=[
"form_field", # Input belongs to form
"label_for", # Label describes input
"button_action", # Button triggers action
"navigation_target" # Link destination
]
)
)
# Output now includes:
# {
# "type": "button",
# "purpose": "submit_form",
# "label": "Submit registration",
# "form": "registration_form",
# "relationships": {
# "submits": "registration_form"
# }
# }
# Agents can now reason: "Click this to submit the registration form"
Problem
OmniParser's output format didn't match what LangChain, AutoGen, or other agent frameworks expected. Converting the output required custom code for each framework.
What I Tried
Attempt 1: Wrote custom converters for each framework. This was maintenance-heavy.
Attempt 2: Modified agent code to accept OmniParser format. This coupled agents to OmniParser.
Actual Fix
Created standard output adapters for popular agent frameworks. OmniParser now outputs in native formats for LangChain Tools, AutoGen Functions, CrewAI Tools, and standard JSON Schema that works with any framework.
# Agent framework integration
from omniparser import OmniParser
from omniparser.adapters import LangChainAdapter, AutoGenAdapter
# Parse page
parser = OmniParser()
elements = parser.parse_url("https://example.com/form")
# Convert to LangChain Tool format
langchain_tools = LangChainAdapter.to_tools(elements)
# Now directly usable in LangChain agents
for tool in langchain_tools:
agent.add_tool(tool)
# Or convert to AutoGen function
autogen_funcs = AutoGenAdapter.to_functions(elements)
# Usable in AutoGen agent conversations
# Or get standard JSON Schema
json_schema = elements.to_json_schema()
# Works with any agent framework
# The adapters handle:
# - Type conversion
# - Action mapping (click, type, select)
# - Parameter mapping (coordinates, text, etc.)
# - Description generation
# - Schema validation
What I Learned
- Accessibility tree > DOM for shadow DOM: Always use accessibility tree as primary detector. It sees shadow DOM natively.
- Vision as fallback, not primary: Vision detection is slower and less accurate. Use it to fill gaps, not as main approach.
- Functional labels > Type labels: Agents need "submit_button" not just "button". Context is everything.
- Relationship extraction enables reasoning: Knowing an input belongs to a form lets agents understand purpose.
- Standard adapters prevent lock-in: Support multiple agent frameworks from day one. Don't reinvent the wheel.
- Component heuristics improve accuracy: Knowing React/Vue/Angular patterns helps detect custom components.
Production Setup
Complete setup for agent-ready UI element extraction.
# Install OmniParser
pip install omniparser
# Install Playwright for browser automation
playwright install chromium
# For agent framework integration
pip install langchain autogen crewai
Production configuration:
from omniparser import OmniParser
from omniparser.adapters import LangChainAdapter
from langchain.agents import AgentExecutor, OpenAIFunctionsAgent
class AgentWithUIUnderstanding:
"""Agent that can see and interact with UI."""
def __init__(self):
self.parser = OmniParser(
detectors=["accessibility", "vision"],
labeler="contextual",
# Performance
cache_parsing=True,
cache_ttl=3600,
# Output format
output_format="agent_ready"
)
def create_agent_for_page(self, url: str):
"""Create agent with tools from parsed page."""
# Parse UI elements
elements = self.parser.parse_url(url)
# Convert to LangChain tools
tools = LangChainAdapter.to_tools(elements)
# Add additional capabilities
tools.extend([
self._create_navigate_tool(),
self._create_wait_tool(),
self._create_extract_tool()
])
# Create agent
agent = OpenAIFunctionsAgent(
llm=ChatOpenAI(model="gpt-4"),
tools=tools,
prompt=PromptTemplate(
input_variables=["input"],
template="You are a helpful agent that can interact with web pages. "
"Available tools: {tool_names}. User request: {input}"
)
)
return AgentExecutor(agent=agent, tools=tools)
def run(self, url: str, task: str):
"""Run agent task on page."""
agent_executor = self.create_agent_for_page(url)
result = agent_executor.invoke({"input": task})
return result
# Usage
agent = AgentWithUIUnderstanding()
result = agent.run(
url="https://example.com/register",
task="Fill out the registration form with name=John, email=john@example.com and submit"
)
Monitoring & Debugging
Red Flags to Watch For
- Element detection rate < 90%: Missing interactive elements. Check shadow DOM handling.
- Misclassification rate > 15%: Buttons labeled as inputs, etc. Improve semantic labeling.
- Parsing time > 5 seconds: Too slow for real-time. Consider caching or lighter detection.
- Agent tool rejection rate > 30%: Tools not matching agent expectations. Review adapter format.
Debug Commands
# Test element detection
omniparser test-detection \
--url https://example.com \
--show-missed-elements
# Analyze semantic labeling
omniparser analyze-semantics \
--page ./page_elements.json \
--check-label-quality
# Benchmark parsing speed
omniparser benchmark \
--urls ./test_urls.txt \
--measure-latency,accuracy