Selenium Wire: Capturing Network Requests That Actually Works
I needed to scrape data that was only loaded via AJAX calls after page load. The data wasn't in the HTML - it was buried in JSON responses. Selenium Wire let me intercept and extract it.
Why Network Interception Matters
Modern web apps load data dynamically. The page source has no actual data - just JavaScript that fetches it. This is where Selenium Wire shines:
- AJAX endpoints: Data loaded via XHR/fetch after page load
- GraphQL APIs: Single endpoint with POST requests
- WebSocket streams: Real-time data pushed to client
- Lazy-loaded content: Data loaded on scroll/interaction
- Authenticated APIs: Calls with auth tokens in headers
Regular Selenium can't see these requests. Selenium Wire extends Selenium to capture everything.
Problem
When using Selenium Wire, pages would timeout before all requests completed. The driver.wait for page load wasn't accounting for delayed AJAX requests that happened seconds after initial load.
Error: TimeoutException: Request timeout after 30 seconds
What I Tried
Attempt 1: Increased request_timeout to 120s - Just delayed the inevitable
Attempt 2: Used time.sleep(30) - Worked but wasteful and unreliable
Attempt 3: Checked for specific request patterns - Too complex for every site
Actual Fix
The solution is to wait for specific requests to complete rather than all requests. Here's a robust implementation:
from seleniumwire import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
# Configure Selenium Wire with reasonable timeouts
options = {
'request_timeout': 60, # Per-request timeout
'connection_timeout': 30, # Connection timeout
'suppress_connection_errors': True, # Don't fail on connection errors
}
driver = webdriver.Chrome(seleniumwire_options=options)
def wait_for_api_requests(driver, url_pattern: str, timeout: int = 30):
"""
Wait for API requests matching a pattern to complete
Args:
driver: Selenium Wire driver
url_pattern: String pattern to match in URL
timeout: Maximum wait time in seconds
"""
start_time = time.time()
matching_requests = []
while time.time() - start_time < timeout:
# Check for matching requests
for request in driver.requests:
if url_pattern in request.url and request.response:
matching_requests.append(request)
if matching_requests:
return matching_requests
time.sleep(0.5)
raise TimeoutException(f"No matching requests found for pattern: {url_pattern}")
# Usage
driver.get('https://example.com/dynamic-page')
# Wait for specific API endpoint to be called
api_requests = wait_for_api_requests(driver, '/api/products')
# Extract data from responses
for request in api_requests:
if request.response:
data = request.response.json()
print(f"Got data: {len(data.get('items', []))} items")
driver.quit()
Problem
When scraping many pages, memory usage kept growing. Selenium Wire stores all requests/responses in driver.requests by default, which accumulated to gigabytes.
What I Tried
Attempt 1: Called del driver.requests - Didn't free memory properly
Attempt 2: Restarted driver every N pages - Added overhead and delays
Attempt 3: Set driver.requests = [] - Helped but still leaked
Actual Fix
Use driver.clear_requests() after each page and configure request harvesting to only keep what you need:
from seleniumwire import webdriver
# Configure to only keep request metadata, not bodies
options = {
'request_storage': 'memory', # or 'redis' for large scale
'request_max_body_size': 1024, # Limit body size to 1KB
'har_limit': 0, # Don't generate HAR (saves memory)
}
driver = webdriver.Chrome(seleniumwire_options=options)
def scrape_page(url: str):
"""
Scrape a single page and clean up
"""
driver.get(url)
# Extract the data you need immediately
data = []
for request in driver.requests:
if request.response and '/api/' in request.url:
try:
body = request.response.body
# Parse and store only what you need
data.append({
'url': request.url,
'status': request.response.status_code,
'body_size': len(body) if body else 0
})
except:
pass
# Clear requests to free memory
driver.clear_requests()
return data
# Scrape multiple pages without memory growth
for url in url_list:
scrape_page(url)
driver.quit()
Problem
Some sites with strict SSL/TLS configurations would fail with certificate errors. Selenium Wire's proxy wasn't handling these certificates properly.
Error: CERTIFICATE_VERIFY_FAILED
What I Tried
Attempt 1: Added --ignore-certificate-errors to Chrome options - Didn't work with Selenium Wire's proxy
Attempt 2: Disabled SSL verification - Still failed on some sites
Actual Fix
Configure Selenium Wire's SSL verification properly and use the correct proxy settings:
from seleniumwire import webdriver
from selenium.webdriver.chrome.options import Options
# Configure Chrome options
chrome_options = Options()
chrome_options.add_argument('--ignore-certificate-errors')
chrome_options.add_argument('--allow-running-insecure-content')
chrome_options.add_argument('--disable-extensions')
# Configure Selenium Wire SSL options
options = {
'verify_ssl': False, # Disable SSL verification (use with caution)
'proxy': {
'https': 'https://localhost:443', # Use localhost proxy
'no_proxy': ['localhost', '127.0.0.1']
}
}
driver = webdriver.Chrome(
options=chrome_options,
seleniumwire_options=options
)
# Now it should work with problematic SSL certificates
driver.get('https://example.com-with-strict-ssl.com')
driver.quit()
What I Learned
- Lesson 1: Don't wait for all requests - wait for the specific ones you need. Pages make tons of tracking/analytics requests you don't care about.
- Lesson 2: Memory management is critical - always clear requests after processing.
- Lesson 3: Request filtering by URL pattern is more efficient than post-processing.
- Lesson 4: Use
driver.wait_for_request()for simple cases, custom wait logic for complex scenarios. - Overall: Selenium Wire is powerful but needs careful configuration for production scraping.
Real-World Examples
Example 1: GraphQL API Interception
from seleniumwire import webdriver
import json
def scrape_graphql_api(url: str):
"""
Intercept GraphQL API calls
"""
driver = webdriver.Chrome()
driver.get(url)
# Wait for GraphQL requests
graphql_requests = [
request for request in driver.requests
if request.response and '/graphql' in request.url
]
for request in graphql_requests:
# GraphQL queries are POST with JSON body
if request.method == 'POST':
try:
# Parse request body
request_data = json.loads(request.body.decode())
query = request_data.get('query', '')
# Parse response
response_data = request.response.json()
print(f"GraphQL Query: {query[:100]}...")
print(f"Response: {json.dumps(response_data, indent=2)}")
except Exception as e:
print(f"Error parsing GraphQL: {e}")
driver.quit()
# Usage
scrape_graphql_api('https://example.com/graphql-app')
Example 2: Capture Authentication Tokens
def extract_auth_tokens(driver):
"""
Extract JWT tokens from request headers
"""
tokens = {}
for request in driver.requests:
if request.headers.get('Authorization'):
# Extract Bearer token
auth_header = request.headers['Authorization']
if auth_header.startswith('Bearer '):
tokens['bearer_token'] = auth_header[7:]
# Check for cookies with auth
if 'cookies' in request.headers:
for cookie in request.headers['cookies']:
if 'auth' in cookie.lower() or 'token' in cookie.lower():
tokens[cookie] = request.headers['cookies'][cookie]
return tokens
# Usage
driver.get('https://example.com/login')
# ... perform login ...
tokens = extract_auth_tokens(driver)
print(f"Found tokens: {list(tokens.keys())}")
Example 3: WebSocket Message Capture
from seleniumwire import webdriver
import json
def capture_websocket_messages(url: str):
"""
Capture WebSocket messages
Note: Selenium Wire has limited WebSocket support
"""
driver = webdriver.Chrome()
driver.get(url)
# Enable WebSocket capture
driver.websockets_enabled = True
# Wait for WebSocket connections
time.sleep(5)
# Access WebSocket messages
if hasattr(driver, 'ws_messages'):
for message in driver.ws_messages:
print(f"WebSocket: {message}")
driver.quit()
# Alternative: Use devtools protocol
def capture_websocket_with_devtools(url: str):
"""
Capture WebSocket using Chrome DevTools Protocol
"""
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
options = Options()
options.set_capability('goog:loggingPrefs', {'performance': 'ALL'})
driver = webdriver.Chrome(options=options)
driver.get(url)
# Get performance logs
logs = driver.get_log('performance')
for entry in logs:
log = json.loads(entry['message'])
message = log.get('message', {})
if 'Network.webSocketFrame' in message.get('method', ''):
print(f"WebSocket frame: {message}")
driver.quit()
Production Setup That Works
# selenium_wire_scraper.py - Production configuration
from seleniumwire import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import json
import time
import logging
from typing import List, Dict, Optional
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class SeleniumWireScraper:
"""
Production scraper using Selenium Wire with:
- Memory management
- Request filtering
- Error handling
- Configurable timeouts
"""
def __init__(
self,
headless: bool = True,
request_timeout: int = 60,
enable_har: bool = False
):
"""
Args:
headless: Run headless browser
request_timeout: Per-request timeout in seconds
enable_har: Enable HAR generation (memory intensive)
"""
self.request_timeout = request_timeout
# Chrome options
chrome_options = Options()
if headless:
chrome_options.add_argument('--headless=new')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--window-size=1920,1080')
chrome_options.add_argument('--disable-blink-features=AutomationControlled')
chrome_options.add_experimental_option('excludeSwitches', ['enable-automation'])
# Selenium Wire options
wire_options = {
'request_timeout': request_timeout,
'connection_timeout': 30,
'suppress_connection_errors': True,
'verify_ssl': False,
'request_storage': 'memory',
'request_max_body_size': 2048, # 2KB limit for bodies
}
# Only enable HAR if needed (memory intensive)
if enable_har:
wire_options['har_limit'] 100
self.driver = webdriver.Chrome(
options=chrome_options,
seleniumwire_options=wire_options
)
def scrape_api_data(
self,
url: str,
api_pattern: str = '/api/',
wait_selector: Optional[str] = None,
scroll_count: int = 0
) -> List[Dict]:
"""
Scrape page and extract API response data
Args:
url: Page URL to scrape
api_pattern: URL pattern to match API requests
wait_selector: CSS selector to wait for before scraping
scroll_count: Number of times to scroll down (for lazy loading)
Returns:
List of captured API responses
"""
try:
logger.info(f"Navigating to {url}")
self.driver.get(url)
# Wait for page element if specified
if wait_selector:
WebDriverWait(self.driver, 15).until(
EC.presence_of_element_located((By.CSS_SELECTOR, wait_selector))
)
# Scroll to trigger lazy loading
for _ in range(scroll_count):
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(1)
# Wait for API requests
api_data = self._wait_for_api_requests(api_pattern)
logger.info(f"Captured {len(api_data)} API responses")
return api_data
except Exception as e:
logger.error(f"Error scraping {url}: {e}")
return []
finally:
# Clean up requests to free memory
self.driver.clear_requests()
def _wait_for_api_requests(
self,
url_pattern: str,
timeout: int = 30
) -> List[Dict]:
"""
Wait for API requests matching pattern
Args:
url_pattern: String to match in URL
timeout: Maximum wait time
Returns:
List of response data
"""
start_time = time.time()
captured_data = []
while time.time() - start_time < timeout:
# Check for matching requests
for request in self.driver.requests:
if url_pattern in request.url and request.response:
# Extract data only once per request
if not any(d['url'] == request.url for d in captured_data):
try:
response_data = {
'url': request.url,
'status': request.response.status_code,
'headers': dict(request.response.headers),
'body': None
}
# Try to parse JSON response
try:
response_data['body'] = request.response.json()
except:
# Not JSON, store as text
response_data['body'] = request.response.text.decode()
captured_data.append(response_data)
except Exception as e:
logger.warning(f"Error parsing response: {e}")
if captured_data:
return captured_data
time.sleep(0.5)
logger.warning(f"Timeout waiting for API requests matching: {url_pattern}")
return captured_data
def scrape_multiple_pages(
self,
urls: List[str],
api_pattern: str = '/api/',
delay_between_pages: float = 2.0
) -> Dict[str, List[Dict]]:
"""
Scrape multiple pages with memory management
Args:
urls: List of URLs to scrape
api_pattern: API URL pattern to match
delay_between_pages: Delay between pages in seconds
Returns:
Dict mapping URLs to captured data
"""
results = {}
for idx, url in enumerate(urls):
logger.info(f"Scraping page {idx + 1}/{len(urls)}: {url}")
data = self.scrape_api_data(url, api_pattern)
results[url] = data
# Delay between pages
if idx < len(urls) - 1:
time.sleep(delay_between_pages)
return results
def close(self):
"""Close the driver"""
self.driver.quit()
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.close()
# Usage
if __name__ == "__main__":
with SeleniumWireScraper(headless=True) as scraper:
# Single page
data = scraper.scrape_api_data(
url='https://example.com/dynamic-page',
api_pattern='/api/products',
wait_selector='.product-list',
scroll_count=3
)
for response in data:
print(f"API: {response['url']}")
print(f"Status: {response['status']}")
if isinstance(response['body'], dict):
print(f"Items: {len(response['body'].get('items', []))}")
# Multiple pages
urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
]
all_data = scraper.scrape_multiple_pages(urls, api_pattern='/api/')
for url, responses in all_data.items():
print(f"{url}: {len(responses)} responses captured")
Monitoring & Debugging
Red Flags to Watch For
- Memory growing: You're not clearing requests between pages
- Timeouts: Request timeout too low or site too slow
- Missing data: Requests triggered after you stop waiting
- Certificate errors: SSL verification issues
Debug Helper
def debug_requests(driver):
"""Print all requests for debugging"""
print(f"Total requests: {len(driver.requests)}")
for request in driver.requests:
print(f"\n{request.method} {request.url}")
if request.response:
print(f" Status: {request.response.status_code}")
print(f" Content-Type: {request.response.headers.get('Content-Type')}")
else:
print(f" No response (pending/failed)")
# Usage
driver.get('https://example.com')
time.sleep(5) # Wait for requests
debug_requests(driver)