Browser-use: AI Agent Framework That Actually Completes Browser Tasks
I needed to build an AI agent that could autonomously navigate websites and complete multi-step tasks like filling forms, clicking through menus, and extracting data. Traditional Playwright scripts break when pages change. Browser-use promised an LLM-powered agent that adapts to changes, but the agent would get stuck or click wrong elements. Here's how I got reliable browser automation.
Problem
The agent would fail to find buttons or forms that were clearly visible on the page. It would say "element not found" even though I could see it in the browser. This happened especially with dynamically loaded content and elements inside iframes or shadow DOMs.
Error: ElementNotFoundError: Cannot find element with text 'Submit' after 3 attempts
What I Tried
Attempt 1: Increased the max_attempts parameter from 3 to 10. The agent would retry but still fail, just taking longer.
Attempt 2: Provided explicit XPath selectors. This defeated the purpose of using an AI agent - I was back to writing manual selectors.
Attempt 3: Added longer wait times for page loads. This helped with dynamic content but didn't fix selector issues.
Actual Fix
The issue was that Browser-use's element detection only looked at the initial page snapshot. I enabled multi-step element detection with vision-based fallback. Now the agent takes screenshots, uses vision models to identify elements, and falls back to text-based search if needed.
# Enhanced element detection
from browser_use import Agent, BrowserConfig
from browser_use.element_detection import VisionElementDetector
# Configure agent with vision-based detection
config = BrowserConfig(
headless=False,
# Multi-step detection
element_detection_strategy="vision_first", # Try vision first
fallback_to_text_search=True, # Fall back to text
# Wait for dynamic content
wait_for_selector_timeout=5000, # 5 seconds
wait_for_stable_content=True, # Wait for content to stop changing
# Vision settings
screenshot_on_error=True,
vision_model="gpt-4o", # Use vision for element detection
# Retry logic
max_retries=3,
retry_delay=1.0
)
agent = Agent(
task="Fill out the contact form and submit",
browser_config=config
)
# Agent now uses vision to find elements
result = agent.run()
Problem
On tasks requiring 5+ steps, the agent would forget earlier actions and get stuck in loops. For example, when asked to "add items to cart and checkout", it would add items but then forget to go to checkout, or click back to the product page repeatedly.
What I Tried
Attempt 1: Broke tasks into smaller subtasks. This required manual orchestration and wasn't autonomous.
Attempt 2: Increased context window. This helped slightly but the agent still lost track of what it had done.
Actual Fix
Enabled Browser-use's memory system with action logging. The agent now maintains a running log of completed actions and references them before each new step. Also added checkpointing for long-running tasks.
# Agent with memory and checkpoints
from browser_use import Agent
from browser_use.memory import ActionMemory, CheckpointManager
# Enable action memory
memory = ActionMemory(
max_history=50, # Remember last 50 actions
summarize_after=20, # Summarize older actions
include_screenshots=True # Store screenshots in memory
)
# Enable checkpointing
checkpoints = CheckpointManager(
checkpoint_every=5, # Save state every 5 actions
checkpoint_dir="./checkpoints",
auto_resume=True # Resume from checkpoint if agent crashes
)
agent = Agent(
task="Add 3 items to cart and complete checkout",
memory=memory,
checkpoints=checkpoints,
# Context management
maintain_context=True,
context_window=8192,
# Task planning
use_planning=True, # Break task into sub-steps
validate_plan=True # Ask for confirmation before executing
)
result = agent.run()
# Agent now:
# 1. Plans the full task
# 2. Logs each action
# 3. Checks memory before acting
# 4. Creates checkpoints for recovery
Problem
The agent would try to click elements before they were ready, or type into fields that were still disabled. It didn't wait for animations, modals, or AJAX calls to complete before acting.
What I Tried
Attempt 1: Added fixed delays (time.sleep) between actions. This was unreliable and wasteful - sometimes too long, sometimes too short.
Attempt 2: Used explicit wait conditions. The agent couldn't predict what conditions to wait for.
Actual Fix
Configured Browser-use's smart waiting with automatic stability detection. The agent now waits for elements to become interactive, animations to complete, and network requests to settle before taking action.
# Smart waiting configuration
from browser_use import Agent, BrowserConfig
from browser_use.waiting import SmartWaiter
config = BrowserConfig(
# Smart waiting
use_smart_wait=True,
wait_strategy="stability", # Wait for content to stabilize
stability_threshold=500, # 500ms of no DOM changes
# Element readiness
wait_for_clickable=True, # Element must be clickable
wait_for_visible=True, # Element must be visible
wait_for_enabled=True, # Element must be enabled
# Network requests
wait_for_network_idle=True, # Wait for AJAX/fetch to complete
network_idle_timeout=2000, # 2 seconds of no network activity
# Animations
wait_for_animations=False, # Don't wait for CSS animations (too slow)
# Timeouts
default_timeout=10000, # 10 second default timeout
page_load_timeout=30000 # 30 second page load timeout
)
agent = Agent(
task="Complete the purchase flow",
browser_config=config
)
# Agent now intelligently waits for:
# - Elements to be ready
# - Network requests to finish
# - Content to stabilize
# - No more failed clicks due to timing!
What I Learned
- Vision-first detection is more reliable: Using GPT-4o's vision to find elements works better than text search alone, especially for dynamic content.
- Memory prevents repeated mistakes: Without action logging, agents repeat failed actions. Memory is essential for multi-step tasks.
- Checkpoints save time: Long-running tasks will fail. Checkpoints let you resume without starting over.
- Smart waiting beats fixed delays: Waiting for stability and network idle is more reliable than arbitrary sleep times.
- Task planning helps execution: Breaking tasks into sub-steps and validating the plan prevents the agent from going off-track.
- Screenshots aid debugging: Always save screenshots on error. They're invaluable for understanding why the agent failed.
Production Setup
Complete configuration for production-ready browser automation agents.
# Install browser-use
pip install browser-use
# Install Playwright browsers
playwright install chromium
# Install vision model dependencies
pip install openai # For GPT-4o vision
pip install anthropic # For Claude vision
Production agent configuration:
import asyncio
from browser_use import Agent, BrowserConfig
from browser_use.memory import ActionMemory, CheckpointManager
from browser_use.monitoring import AgentMonitor
class ProductionBrowserAgent:
"""Production-ready browser automation agent."""
def __init__(self):
# Configure browser
self.config = BrowserConfig(
headless=True, # Run headless in production
browser_type="chromium",
# Element detection
element_detection_strategy="vision_first",
vision_model="gpt-4o",
screenshot_on_error=True,
# Smart waiting
use_smart_wait=True,
wait_strategy="stability",
stability_threshold=500,
wait_for_network_idle=True,
# Performance
disable_gpu=False,
user_data_dir="./browser_profile",
)
# Memory and checkpoints
self.memory = ActionMemory(
max_history=100,
summarize_after=50,
include_screenshots=True,
persist_to_disk=True,
memory_dir="./agent_memory"
)
self.checkpoints = CheckpointManager(
checkpoint_every=5,
checkpoint_dir="./checkpoints",
auto_resume=True,
keep_last_n=10
)
# Monitoring
self.monitor = AgentMonitor(
log_level="INFO",
log_file="agent.log",
metrics_enabled=True,
alert_on_failure=True
)
async def run_task(self, task: str, max_steps: int = 50):
"""Run a task with all production features."""
agent = Agent(
task=task,
browser_config=self.config,
memory=self.memory,
checkpoints=self.checkpoints,
monitor=self.monitor,
max_steps=max_steps,
use_planning=True
)
result = await agent.run()
return result
# Usage
async def main():
agent = ProductionBrowserAgent()
# Run complex task
result = await agent.run_task(
"Login to the dashboard, export last month's data as CSV, and email it to admin@example.com",
max_steps=30
)
print(f"Task completed: {result.success}")
print(f"Steps taken: {len(result.actions)}")
if __name__ == "__main__":
asyncio.run(main())
Monitoring & Debugging
Key metrics for monitoring browser agents in production.
Red Flags to Watch For
- Element detection failure rate > 20%: Agent can't find elements. Check vision model or try different detection strategy.
- Action retry rate > 30%: Timing issues or unstable page. Increase stability_threshold.
- Task completion rate < 60%: Agent failing too often. Review task instructions and add more examples.
- Average steps per task > 2x expected: Agent is inefficient or stuck in loops. Check memory and planning.
- Checkpoint recovery rate > 50%: Agent crashing frequently. Investigate stability issues.
Debug Commands
# View agent memory
python -m browser_use.tools.inspect_memory \
--memory_dir ./agent_memory \
--show_screenshots
# Replay agent execution
python -m browser_use.tools.replay \
--checkpoint ./checkpoints/task_123
# Test element detection
python -m browser_use.tools.test_detection \
--url https://example.com \
--element "Submit button"
# Monitor live agent
python -m browser_use.tools.monitor \
--log_file agent.log \
--follow