I spent two weeks trying to scrape a client's React dashboard. Requests returned empty divs. Selenium got blocked immediately. Regular Playwright lasted about 20 requests before hitting a 403.
Eventually figured out the problem: modern SPAs (React, Vue, Angular) render everything client-side, and they've gotten really good at detecting automation. The content you want appears seconds after page load, and if you're using WebDriver, they know.
The solution
Use Playwright with playwright-stealth (adapted from puppeteer-extra). It patches browser automation traces. Combine it with Playwright's code recorder and you can scrape any SPA without writing selectors manually.
Why SPAs break traditional scrapers
# What requests/BeautifulSoup sees:
<div id="app"></div>
<div id="root"></div>
# Content rendered by JavaScript later
# What you actually want:
<div class="product">iPhone 15</div>
<div class="product">Samsung S24</div>
The HTML source never contains the product data. It arrives via XHR calls after page load, then gets inserted into the DOM. You need a browser that executes JavaScript.
Installing the tools
pip install playwright playwright-stealth playwright install chromium
playwright-stealth is a Python port of the puppeteer-extra stealth plugin. It's maintained separately from Playwright itself.
Getting stealth to work
Basic setup to avoid detection:
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
# This patches the browser
stealth_sync(page)
page.goto('https://facebook.com')
page.wait_for_selector('.data-loaded')
content = page.inner_text('.data-loaded')
print(content)
browser.close()
Always test with headless=False first. You can see what's happening and verify stealth is working.
How stealth patches the browser
The plugin modifies several things bot detection scripts look for:
| Detection vector | What stealth does |
|---|---|
navigator.webdriver |
Removes the flag (sets to undefined) |
window.chrome |
Adds missing chrome object |
navigator.permissions |
Mocks permission query responses |
navigator.plugins |
Returns fake plugin list |
| WebGL renderer | Masks GPU information |
| CDP traces | Hides Chrome DevTools Protocol indicators |
Use the code recorder
This is the killer feature. No more inspecting elements or guessing CSS selectors:
# Start recording playwright codegen https://facebook.com
What happens:
- Browser opens with the site loaded
- You click around, fill forms, scroll
- Code generates in the sidebar in real-time
- Copy and paste when done
Generated output:
from playwright.sync_api import Page, expect
def run(page: Page) -> None:
page.goto("https://facebook.com/")
page.get_by_role("button", name="Load More").click()
page.wait_for_selector(".product-card")
products = page.locator(".product-card").all()
for product in products:
print(product.inner_text())
Why this matters
The recorder uses role-based selectors (button, link, heading) instead of CSS classes. These don't break when the app rebuilds and class names change.
Handling async content
React/Vue apps load data asynchronously. Multiple ways to wait:
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
stealth_sync(page)
page.goto('https://facebook.com')
# Option 1: Wait for element to appear
page.wait_for_selector('.loaded-content')
# Option 2: Wait for no network requests
page.wait_for_load_state('networkidle')
# Option 3: Wait for specific text
page.wait_for_selector('text=Data loaded')
# Option 4: Wait for API response
with page.expect_response('/api/data') as response_info:
page.click('button:has-text("Refresh")')
response = response_info.value
# Now scrape
items = page.locator('.item').all()
print(f"Found {len(items)} items")
I usually use networkidle + selector wait. Belt and suspenders approach.
Selecting elements in React apps
Playwright has special React support. These selectors are more stable than CSS:
# By role (accessible name)
submit = page.get_by_role('button', name='Submit')
# By test ID (if developers added them)
element = page.get_by_test_id('submit-button')
# By text content
title = page.get_by_text('Welcome back')
# Combine filters
sale_items = page.get_by_role('listitem').filter(
has_text='sale'
)
# React DevTools locator
from playwright.sync_api import Page
page.get_by_test_id('user-profile')
Dealing with infinite scroll
Most modern shops use this. Here's a pattern that works:
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
import time
def scrape_infinite_scroll(url):
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
stealth_sync(page)
page.goto(url)
items = set()
while True:
# Wait for items to load
page.wait_for_selector('.item')
current = page.locator('.item').all_text_contents()
new_items = set(current) - items
if not new_items:
break # Reached the end
items.update(new_items)
print(f"Collected {len(items)} items...")
# Scroll down to trigger more
page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
time.sleep(2)
browser.close()
return list(items)
Intercept API responses
Sometimes it's easier to grab the JSON directly:
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
stealth_sync(page)
api_responses = []
def handle_response(response):
if '/api/products' in response.url:
api_responses.append(response.json())
page.on('response', handle_response)
page.goto('https://facebook.com')
# Trigger the API call
page.click('button:has-text("Load")')
page.wait_for_load_state('networkidle')
# Process the JSON data directly
for product in api_responses[0]['results']:
print(f"{product['name']}: ${product['price']}")
This skips all the HTML parsing. The data arrives already structured.
Errors I encountered
"Element not found" even with wait_for_selector
React components render conditionally. The element might be in a loading state:
# Wait for element to be attached AND visible
page.wait_for_selector('.item', state='attached', state='visible')
# Or wait for it to NOT be loading
page.wait_for_selector('.content:not(.loading)')
Still getting blocked
Some sites check more than WebDriver flags:
# Don't use headless mode
browser = p.chromium.launch(headless=False)
# Set a realistic viewport
page.set_viewport_size({'width': 1920, 'height': 1080})
# Add user agent
page = browser.new_context(
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
).new_page()
Content loads but shows loading spinner forever
Multiple loading states in SPAs:
# Wait for spinner to disappear first
page.wait_for_selector('.spinner', state='hidden', timeout=15000)
# Then wait for actual content
page.wait_for_selector('.content', state='visible')
Code recorder uses fragile selectors
It might generate CSS selectors like `div > div:nth-child(2)`:
# Replace with stable selectors
page.click('button:has-text("Submit")')
page.get_by_role('button', name='Submit').click()
Extra stealth measures
For sites with aggressive detection:
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
with sync_playwright() as p:
browser = p.chromium.launch(
headless=False,
args=[
'--disable-blink-features=AutomationControlled',
'--disable-infobars',
]
)
context = browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
locale='en-US',
timezone_id='America/New_York',
)
page = context.new_page()
stealth_sync(page)
# Scroll gradually like a human would
page.goto('https://facebook.com')
for i in range(5):
page.evaluate(f'window.scrollTo(0, {i * 300})')
page.wait_for_timeout(800)
Which approach to use
Not every site needs stealth:
- requests + BeautifulSoup - Static HTML, server-side rendering
- Regular Playwright - Simple JS sites without detection
- Playwright + stealth - Bot detection, login-protected content
- Official API - When available and rate limits work for you
Writing this down because I know I'll forget the stealth configuration next time I need it. The code recorder alone is worth the setup time - no more guessing selectors or debugging why `element.click()` keeps failing.