How I got here
Used DrissionPage on and off for maybe a year. Basic stuff worked fine - navigate to page, click some buttons, extract text. Thought I had it figured out.
Then tried scraping a site with heavy JavaScript and Cloudflare. Got blocked constantly. Sessions kept expiring. Dynamic content wouldn't load. Realized there's a whole other level to this.
Spent way too much time figuring this stuff out through trial and error. Documentation exists but it's scattered. Eventually cobbled together solutions that actually work in production.
Not claiming this is the "right" way or "best practices". Just what worked for me after lots of failed attempts. If you're stuck at the basic level and need to handle real sites, maybe this helps.
Session management
Real sites need login, cookies, headers. Here's how to manage sessions properly.
Persistent sessions
from DrissionPage import ChromiumPage
import pickle
import os
class SessionManager:
def __init__(self, session_file='session.pkl'):
self.session_file = session_file
self.cookies = {}
self.headers = {}
def save_session(self, page):
"""Save cookies and headers to file"""
self.cookies = page.cookies(as_dict=True)
self.headers = {
'User-Agent': page.user_agent,
'Referer': page.url,
}
with open(self.session_file, 'wb') as f:
pickle.dump({
'cookies': self.cookies,
'headers': self.headers,
}, f)
def load_session(self, page):
"""Load saved session into browser"""
if not os.path.exists(self.session_file):
return False
with open(self.session_file, 'rb') as f:
data = pickle.load(f)
# Restore cookies
for name, value in data['cookies'].items():
page.set.cookies(name, value)
return True
# Usage
page = ChromiumPage()
manager = SessionManager()
# Try to load existing session
if not manager.load_session(page):
# No session found, need to login
page.get('https://example.com/login')
page.ele('#username').input('myusername')
page.ele('#password').input('mypassword')
page.ele('#login-btn').click()
# Wait for login to complete
page.wait.load_start()
# Save session for next time
manager.save_session(page)
Custom headers and auth
from DrissionPage import ChromiumPage
# Set custom headers before navigation
page = ChromiumPage()
# Add authorization headers
page.set.headers({
'Authorization': 'Bearer your-token-here',
'X-API-Key': 'your-api-key',
'Accept': 'application/json',
})
# Set user agent to avoid detection
page.set.user_agent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/120.0.0.0 Safari/537.36'
)
# Navigate with custom headers
page.get('https://api.example.com/protected',
headers={'X-Custom-Header': 'value'})
Handle multiple sessions
from DrissionPage import ChromiumPage
import threading
class MultiSessionScraper:
def __init__(self, num_sessions=3):
self.sessions = []
self.num_sessions = num_sessions
def create_session(self, user_creds):
"""Create browser instance with login"""
page = ChromiumPage()
# Login with credentials
page.get('https://example.com/login')
page.ele('#username').input(user_creds['username'])
page.ele('#password').input(user_creds['password'])
page.ele('#login-btn').click()
page.wait.load_start()
return page
def init_sessions(self, credentials_list):
"""Initialize multiple sessions"""
threads = []
for creds in credentials_list[:self.num_sessions]:
thread = threading.Thread(
target=lambda: self.sessions.append(
self.create_session(creds)
)
)
threads.append(thread)
thread.start()
# Wait for all sessions to initialize
for thread in threads:
thread.join()
def scrape_with_rotation(self, urls):
"""Rotate through sessions to avoid rate limiting"""
results = []
for i, url in enumerate(urls):
# Use session in round-robin fashion
session = self.sessions[i % self.num_sessions]
session.get(url)
data = session.ele('.content').text
results.append(data)
return results
Handling dynamic content
Modern sites load content asynchronously. Here's how to handle it.
Wait strategies
Don't just use sleep(). Use smart waiting.
from DrissionPage import ChromiumPage
import time
page = ChromiumPage()
# Strategy 1: Wait for specific element
page.get('https://example.com/dynamic')
element = page.ele('#dynamic-content', timeout=10)
# Waits up to 10 seconds for element to appear
# Strategy 2: Wait for page state
page.wait.load_start() # Wait for page to start loading
page.wait.doc_loaded() # Wait for document to complete
page.wait.network_idle() # Wait for network to be idle
# Strategy 3: Wait for custom condition
def content_loaded(page):
return page.ele('.data-table') is not None
page.wait(5, content_loaded) # Check every 0.5s for 5s
# Strategy 4: Wait for URL change
page.get('https://example.com/redirect')
page.wait.url_change('https://example.com/target')
# Strategy 5: Wait for element count
def items_loaded(page):
return len(page.eles('.item')) > 10
page.wait(10, items_loaded)
Infinite scroll handling
Load all items from infinite scroll pages.
def scrape_infinite_scroll(page, max_scrolls=50):
"""Scrape all items from infinite scroll page"""
items = set() # Use set to avoid duplicates
scroll_count = 0
previous_height = 0
while scroll_count < max_scrolls:
# Collect current items
for item in page.eles('.product-card'):
item_id = item.attr('data-id')
if item_id and item_id not in items:
items.add(item_id)
# Process item here
title = item.ele('.title').text
price = item.ele('.price').text
print(f"Scraped: {title} - {price}")
# Scroll to bottom
page.scroll.to_bottom()
# Wait for new content to load
page.wait(2)
# Check if page height changed (new content loaded)
current_height = page.html.scrollHeight
if current_height == previous_height:
# No new content, reached end
break
previous_height = current_height
scroll_count += 1
return len(items)
# Usage
page.get('https://example.com/products')
total_items = scrape_infinite_scroll(page)
print(f"Total items scraped: {total_items}")
WebSocket interception
Capture real-time data from WebSocket connections.
from DrissionPage import ChromiumPage
import json
page = ChromiumPage()
# Enable network monitoring
page.set.none_mode()
# Navigate to page with WebSocket
page.get('https://example.com/live-data')
# Capture WebSocket messages
ws_messages = []
def on_websocket(message):
"""Handle WebSocket message"""
data = json.loads(message)
ws_messages.append(data)
print(f"Received: {data}")
# Listen for WebSocket traffic
page.listen.start('websocket')
# Wait for data
page.wait(10)
# Get captured messages
messages = page.listen.wait(extra_cond=lambda m: m.type == 'websocket')
for msg in messages:
if msg.ws_data:
data = json.loads(msg.ws_data)
# Process WebSocket data
print(data)
# Stop listening
page.listen.stop()
Bypassing detection
Sites fight bots. Here's how to stay under the radar.
Browser fingerprinting
Hide automation characteristics.
from DrissionPage import ChromiumPage
import random
# Create page with anti-detection settings
page = ChromiumPage(
arguments=[
'--disable-blink-features=AutomationControlled',
'--disable-dev-shm-usage',
'--no-sandbox',
'--disable-gpu',
]
)
# Remove webdriver traces
page.run_cdp('Page.addScriptToEvaluateOnNewDocument', {
'source': '''
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
Object.defineProperty(navigator, 'plugins', {
get: () => [1, 2, 3, 4, 5]
});
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en']
});
window.chrome = {
runtime: {}
};
'''
})
# Randomize behavior
def random_sleep(min_sec=1, max_sec=3):
"""Sleep for random duration"""
time.sleep(random.uniform(min_sec, max_sec))
# Mimic human typing
def human_type(element, text):
"""Type text like a human"""
for char in text:
element.input(char, delay=random.randint(50, 200))
# Use it
page.get('https://example.com')
search_box = page.ele('#search')
human_type(search_box, 'search query')
random_sleep(0.5, 1.5)
search_box.enter()
Cloudflare bypass
Handle Cloudflare challenges gracefully.
from DrissionPage import ChromiumPage
import time
def bypass_cloudflare(page, url, max_wait=30):
"""
Navigate through Cloudflare challenge
"""
page.get(url)
start_time = time.time()
while time.time() - start_time < max_wait:
# Check if we're on Cloudflare challenge page
if 'challenge-platform' in page.url:
print("Waiting for Cloudflare challenge...")
# Wait for challenge to complete
# Check for success indicators
if page.ele('title:text=Just a moment'):
time.sleep(5) # Wait for Cloudflare to process
continue
# Check if we're through
if not 'cloudflare' in page.url.lower():
print("Cloudflare bypassed!")
return True
# Check if we're already on target page
if not 'challenge' in page.url:
return True
time.sleep(2)
print("Failed to bypass Cloudflare")
return False
# Enhanced approach with browser profile
def create_stealth_browser():
"""Create browser that looks more human"""
page = ChromiumPage(
user_data_path='./browser_profile', # Persist profile
arguments=[
'--disable-blink-features=AutomationControlled',
'--exclude-switches=enable-automation',
'--disable-infobars',
]
)
# Set realistic viewport
page.set.window.size(1920, 1080)
page.set.window.max()
return page
# Usage
page = create_stealth_browser()
success = bypass_cloudflare(page, 'https://protected-site.com')
if success:
# Proceed with scraping
data = page.ele('.content').text
else:
print("Could not access site")
Rate limiting and delays
Implement smart delays to avoid bans.
import time
import random
from DrissionPage import ChromiumPage
class SmartDelayer:
def __init__(self, min_delay=1, max_delay=3):
self.min_delay = min_delay
self.max_delay = max_delay
self.last_action = time.time()
def wait(self):
"""Wait with random duration"""
# Calculate time since last action
elapsed = time.time() - self.last_action
# Add random delay
delay = random.uniform(self.min_delay, self.max_delay)
if elapsed < delay:
time.sleep(delay - elapsed)
self.last_action = time.time()
def human_action(self, action_func, *args, **kwargs):
"""Execute action with human-like delay"""
self.wait()
result = action_func(*args, **kwargs)
return result
# Usage
page = ChromiumPage()
delayer = SmartDelays(min_delay=2, max_delay=5)
urls = ['url1.com', 'url2.com', 'url3.com']
for url in urls:
delayer.human_action(page.get, url)
# Process page
data = page.ele('.content').text
# Random longer pause between pages
time.sleep(random.uniform(10, 20))
Concurrent scraping
Speed up scraping with multiple browsers.
Thread-based parallel scraping
from DrissionPage import ChromiumPage
import threading
import queue
import time
class WorkerThread(threading.Thread):
def __init__(self, task_queue, result_queue):
super().__init__()
self.task_queue = task_queue
self.result_queue = result_queue
self.page = None
def run(self):
"""Worker thread that processes tasks"""
# Create browser instance for this thread
self.page = ChromiumPage()
while True:
try:
# Get task from queue
task = self.task_queue.get(timeout=5)
if task is None: # Poison pill
break
# Process task
url, action = task
try:
result = self.process_task(url, action)
self.result_queue.put(result)
except Exception as e:
self.result_queue.put({'error': str(e), 'url': url})
self.task_queue.task_done()
except queue.Empty:
continue
# Cleanup
if self.page:
self.page.quit()
def process_task(self, url, action):
"""Process single scraping task"""
self.page.get(url)
self.page.wait.load_start()
if action == 'scrape':
return {
'url': url,
'data': self.page.ele('.content').text,
}
elif action == 'screenshot':
return {
'url': url,
'screenshot': self.page.get_screenshot(),
}
def parallel_scrape(urls, num_workers=3):
"""Scrape multiple URLs in parallel"""
# Create queues
task_queue = queue.Queue()
result_queue = queue.Queue()
# Add tasks to queue
for url in urls:
task_queue.put((url, 'scrape'))
# Add poison pills
for _ in range(num_workers):
task_queue.put(None)
# Create and start workers
workers = []
for _ in range(num_workers):
worker = WorkerThread(task_queue, result_queue)
worker.start()
workers.append(worker)
# Wait for completion
for worker in workers:
worker.join()
# Collect results
results = []
while not result_queue.empty():
results.append(result_queue.get())
return results
# Usage
urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3',
'https://example.com/page4',
'https://example.com/page5',
]
results = parallel_scrape(urls, num_workers=3)
print(f"Scraped {len(results)} pages")
Resource management
class BrowserPool:
"""Pool of browser instances for efficient resource usage"""
def __init__(self, max_size=5):
self.pool = []
self.max_size = max_size
self.lock = threading.Lock()
def acquire(self):
"""Get browser from pool"""
with self.lock:
if self.pool:
return self.pool.pop()
else:
# Create new browser
return ChromiumPage()
def release(self, browser):
"""Return browser to pool"""
with self.lock:
if len(self.pool) < self.max_size:
# Clear cookies and cache
browser.cookies.clear_all()
self.pool.append(browser)
else:
# Pool full, close browser
browser.quit()
def cleanup(self):
"""Close all browsers in pool"""
with self.lock:
for browser in self.pool:
browser.quit()
self.pool.clear()
# Usage
pool = BrowserPool(max_size=3)
try:
browser = pool.acquire()
browser.get('https://example.com')
# Do work
data = browser.ele('.content').text
pool.release(browser)
finally:
pool.cleanup()
Production-ready error handling
Real scrapers fail. Handle failures gracefully.
Retry with exponential backoff
import time
from functools import wraps
def retry(max_attempts=3, base_delay=1, max_delay=60):
"""Decorator for retry with exponential backoff"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
attempt = 0
while attempt < max_attempts:
try:
return func(*args, **kwargs)
except Exception as e:
attempt += 1
if attempt >= max_attempts:
raise
# Exponential backoff
delay = min(base_delay * (2 ** attempt), max_delay)
print(f"Attempt {attempt} failed, retrying in {delay}s...")
time.sleep(delay)
return None
return wrapper
return decorator
# Usage
@retry(max_attempts=3, base_delay=2)
def scrape_page(page, url):
page.get(url)
page.wait.load_start()
return page.ele('.content').text
# Try scraping
try:
data = scrape_page(page, 'https://example.com')
except Exception as e:
print(f"Failed after retries: {e}")
Comprehensive error handling
from DrissionPage import ChromiumPage
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class RobustScraper:
def __init__(self):
self.page = None
self.max_retries = 3
def init_browser(self):
"""Initialize browser with error handling"""
try:
self.page = ChromiumPage()
logger.info("Browser initialized successfully")
return True
except Exception as e:
logger.error(f"Failed to initialize browser: {e}")
return False
def safe_scrape(self, url, selector):
"""Scrape with comprehensive error handling"""
if not self.page:
if not self.init_browser():
return None
for attempt in range(self.max_retries):
try:
# Navigate with timeout
self.page.get(url, timeout=30)
# Wait for page load
self.page.wait.load_start()
# Find element
element = self.page.ele(selector, timeout=10)
if not element:
logger.warning(f"Element not found: {selector}")
return None
# Extract data
return element.text
except TimeoutError:
logger.warning(f"Timeout on attempt {attempt + 1}")
if attempt < self.max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
except Exception as e:
logger.error(f"Error on attempt {attempt + 1}: {e}")
if attempt == self.max_retries - 1:
# Last attempt failed, try restarting browser
self.restart_browser()
return None
def restart_browser(self):
"""Restart browser to recover from errors"""
logger.info("Restarting browser...")
try:
if self.page:
self.page.quit()
self.init_browser()
except Exception as e:
logger.error(f"Failed to restart browser: {e}")
def cleanup(self):
"""Cleanup resources"""
if self.page:
self.page.quit()
# Usage
scraper = RobustScraper()
try:
data = scraper.safe_scrape(
'https://example.com',
'.main-content'
)
if data:
print(f"Scraped: {data}")
finally:
scraper.cleanup()
Production patterns
Real project patterns I use.
Configuration management
import yaml
import os
class Config:
"""Centralized configuration management"""
def __init__(self, config_file='config.yaml'):
with open(config_file) as f:
self.config = yaml.safe_load(f)
def get(self, key, default=None):
"""Get configuration value"""
keys = key.split('.')
value = self.config
for k in keys:
value = value.get(k)
if value is None:
return default
return value
# config.yaml
"""
scraping:
concurrent_browsers: 3
page_timeout: 30
retry_attempts: 3
delay_between_requests: 2
proxies:
- type: http
host: proxy1.example.com
port: 8080
- type: socks5
host: proxy2.example.com
port: 1080
logging:
level: INFO
file: scraper.log
"""
# Usage
config = Config()
timeout = config.get('scraping.page_timeout', 30)
proxies = config.get('proxies', [])
Data pipeline integration
import csv
import json
from datetime import datetime
class DataPipeline:
"""Handle data extraction and storage"""
def __init__(self, output_format='json'):
self.output_format = output_format
self.data = []
def extract_data(self, page, schema):
"""Extract structured data based on schema"""
record = {}
for field, selector in schema.items():
element = page.ele(selector)
record[field] = element.text if element else None
record['scraped_at'] = datetime.now().isoformat()
self.data.append(record)
return record
def save(self, filename):
"""Save data to file"""
if self.output_format == 'json':
with open(filename, 'w') as f:
json.dump(self.data, f, indent=2)
elif self.output_format == 'csv':
if self.data:
keys = self.data[0].keys()
with open(filename, 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=keys)
writer.writeheader()
writer.writerows(self.data)
# Usage
pipeline = DataPipeline(output_format='json')
schema = {
'title': '.product-title',
'price': '.product-price',
'description': '.product-description',
'rating': '.rating'
}
for url in product_urls:
page.get(url)
page.wait.load_start()
pipeline.extract_data(page, schema)
pipeline.save('products.json')
When to use DrissionPage vs alternatives
My decision framework.
| Use Case | Best Tool | Why |
|---|---|---|
| Simple static sites | requests + BeautifulSoup | Fast, lightweight, no browser overhead |
| JavaScript-heavy sites | DrissionPage | Better detection bypass than Selenium |
| Simple automation | Playwright | Better documentation, multi-language |
| Large scale scraping | Scrapy + DrissionPage | Scrapy framework, DrissionPage downloader |
| Anti-bot protected sites | DrissionPage | Best detection bypass |
Common intermediate issues
Problems I hit at this level.
Issue: Memory leaks with long-running scrapers
Fix: Periodically restart browser. Clear cache and cookies. Use browser pool with max size. Monitor memory usage and restart when threshold exceeded.
Issue: Session expiring mid-scrape
Fix: Implement session refresh logic. Detect auth failures (redirect to login). Store multiple credentials. Auto-relogin when session expires.
Issue: Captcha appearing frequently
Fix: Slow down requests. Rotate IP addresses with proxies. Solve captchas with 2Captcha or DeathByCaptcha. Use headful mode occasionally to see what's happening.
Issue: Dynamic content not loading
Fix: Use network_idle wait. Monitor XHR requests. Wait for specific element count. Check for WebSocket data. Some content loaded via lazy loading - scroll to trigger.
Issue: Browser crashes with many tabs
Fix: Limit concurrent browsers. Close tabs when done. Use incognito mode for isolation. Restart browser periodically.
Moving forward
At this level, you can handle most scraping challenges. Session management, dynamic content, detection bypass - these techniques cover 80% of real-world scenarios.
Next step would be advanced topics: distributed scraping with Redis queue, machine learning for CAPTCHA solving, building scrapers as microservices. But that's another article.
Best way to learn: pick a challenging site (e-commerce, social media, news aggregator) and build a production scraper. You'll hit problems not covered here - solving them is how you get to advanced level.
DrissionPage's GitHub repo has good examples. Join their Discord for community help. Read the source code when stuck - it's well-written.
Happy scraping. Stay under the radar.