gallery-dl instagram DrissionPage: What I Learned After Using It for a Year

DrissionPage: What I Learned After Using It for a Year

Started with basic page loads, kept hitting walls. Here's how I handled complex sites, detection, sessions, and actually got stuff working reliably.

How I got here

Used DrissionPage on and off for maybe a year. Basic stuff worked fine - navigate to page, click some buttons, extract text. Thought I had it figured out.

Then tried scraping a site with heavy JavaScript and Cloudflare. Got blocked constantly. Sessions kept expiring. Dynamic content wouldn't load. Realized there's a whole other level to this.

Spent way too much time figuring this stuff out through trial and error. Documentation exists but it's scattered. Eventually cobbled together solutions that actually work in production.

Not claiming this is the "right" way or "best practices". Just what worked for me after lots of failed attempts. If you're stuck at the basic level and need to handle real sites, maybe this helps.

Session management

Real sites need login, cookies, headers. Here's how to manage sessions properly.

Persistent sessions

from DrissionPage import ChromiumPage
import pickle
import os

class SessionManager:
    def __init__(self, session_file='session.pkl'):
        self.session_file = session_file
        self.cookies = {}
        self.headers = {}

    def save_session(self, page):
        """Save cookies and headers to file"""
        self.cookies = page.cookies(as_dict=True)
        self.headers = {
            'User-Agent': page.user_agent,
            'Referer': page.url,
        }

        with open(self.session_file, 'wb') as f:
            pickle.dump({
                'cookies': self.cookies,
                'headers': self.headers,
            }, f)

    def load_session(self, page):
        """Load saved session into browser"""
        if not os.path.exists(self.session_file):
            return False

        with open(self.session_file, 'rb') as f:
            data = pickle.load(f)

        # Restore cookies
        for name, value in data['cookies'].items():
            page.set.cookies(name, value)

        return True

# Usage
page = ChromiumPage()
manager = SessionManager()

# Try to load existing session
if not manager.load_session(page):
    # No session found, need to login
    page.get('https://example.com/login')
    page.ele('#username').input('myusername')
    page.ele('#password').input('mypassword')
    page.ele('#login-btn').click()

    # Wait for login to complete
    page.wait.load_start()

    # Save session for next time
    manager.save_session(page)

Custom headers and auth

from DrissionPage import ChromiumPage

# Set custom headers before navigation
page = ChromiumPage()

# Add authorization headers
page.set.headers({
    'Authorization': 'Bearer your-token-here',
    'X-API-Key': 'your-api-key',
    'Accept': 'application/json',
})

# Set user agent to avoid detection
page.set.user_agent(
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
    'AppleWebKit/537.36 (KHTML, like Gecko) '
    'Chrome/120.0.0.0 Safari/537.36'
)

# Navigate with custom headers
page.get('https://api.example.com/protected',
         headers={'X-Custom-Header': 'value'})

Handle multiple sessions

from DrissionPage import ChromiumPage
import threading

class MultiSessionScraper:
    def __init__(self, num_sessions=3):
        self.sessions = []
        self.num_sessions = num_sessions

    def create_session(self, user_creds):
        """Create browser instance with login"""
        page = ChromiumPage()

        # Login with credentials
        page.get('https://example.com/login')
        page.ele('#username').input(user_creds['username'])
        page.ele('#password').input(user_creds['password'])
        page.ele('#login-btn').click()
        page.wait.load_start()

        return page

    def init_sessions(self, credentials_list):
        """Initialize multiple sessions"""
        threads = []
        for creds in credentials_list[:self.num_sessions]:
            thread = threading.Thread(
                target=lambda: self.sessions.append(
                    self.create_session(creds)
                )
            )
            threads.append(thread)
            thread.start()

        # Wait for all sessions to initialize
        for thread in threads:
            thread.join()

    def scrape_with_rotation(self, urls):
        """Rotate through sessions to avoid rate limiting"""
        results = []
        for i, url in enumerate(urls):
            # Use session in round-robin fashion
            session = self.sessions[i % self.num_sessions]
            session.get(url)
            data = session.ele('.content').text
            results.append(data)

        return results

Handling dynamic content

Modern sites load content asynchronously. Here's how to handle it.

Wait strategies

Don't just use sleep(). Use smart waiting.

from DrissionPage import ChromiumPage
import time

page = ChromiumPage()

# Strategy 1: Wait for specific element
page.get('https://example.com/dynamic')
element = page.ele('#dynamic-content', timeout=10)
# Waits up to 10 seconds for element to appear

# Strategy 2: Wait for page state
page.wait.load_start()  # Wait for page to start loading
page.wait.doc_loaded()  # Wait for document to complete
page.wait.network_idle()  # Wait for network to be idle

# Strategy 3: Wait for custom condition
def content_loaded(page):
    return page.ele('.data-table') is not None

page.wait(5, content_loaded)  # Check every 0.5s for 5s

# Strategy 4: Wait for URL change
page.get('https://example.com/redirect')
page.wait.url_change('https://example.com/target')

# Strategy 5: Wait for element count
def items_loaded(page):
    return len(page.eles('.item')) > 10

page.wait(10, items_loaded)

Infinite scroll handling

Load all items from infinite scroll pages.

def scrape_infinite_scroll(page, max_scrolls=50):
    """Scrape all items from infinite scroll page"""
    items = set()  # Use set to avoid duplicates
    scroll_count = 0
    previous_height = 0

    while scroll_count < max_scrolls:
        # Collect current items
        for item in page.eles('.product-card'):
            item_id = item.attr('data-id')
            if item_id and item_id not in items:
                items.add(item_id)
                # Process item here
                title = item.ele('.title').text
                price = item.ele('.price').text
                print(f"Scraped: {title} - {price}")

        # Scroll to bottom
        page.scroll.to_bottom()

        # Wait for new content to load
        page.wait(2)

        # Check if page height changed (new content loaded)
        current_height = page.html.scrollHeight
        if current_height == previous_height:
            # No new content, reached end
            break

        previous_height = current_height
        scroll_count += 1

    return len(items)

# Usage
page.get('https://example.com/products')
total_items = scrape_infinite_scroll(page)
print(f"Total items scraped: {total_items}")

WebSocket interception

Capture real-time data from WebSocket connections.

from DrissionPage import ChromiumPage
import json

page = ChromiumPage()

# Enable network monitoring
page.set.none_mode()

# Navigate to page with WebSocket
page.get('https://example.com/live-data')

# Capture WebSocket messages
ws_messages = []

def on_websocket(message):
    """Handle WebSocket message"""
    data = json.loads(message)
    ws_messages.append(data)
    print(f"Received: {data}")

# Listen for WebSocket traffic
page.listen.start('websocket')

# Wait for data
page.wait(10)

# Get captured messages
messages = page.listen.wait(extra_cond=lambda m: m.type == 'websocket')

for msg in messages:
    if msg.ws_data:
        data = json.loads(msg.ws_data)
        # Process WebSocket data
        print(data)

# Stop listening
page.listen.stop()

Bypassing detection

Sites fight bots. Here's how to stay under the radar.

Browser fingerprinting

Hide automation characteristics.

from DrissionPage import ChromiumPage
import random

# Create page with anti-detection settings
page = ChromiumPage(
    arguments=[
        '--disable-blink-features=AutomationControlled',
        '--disable-dev-shm-usage',
        '--no-sandbox',
        '--disable-gpu',
    ]
)

# Remove webdriver traces
page.run_cdp('Page.addScriptToEvaluateOnNewDocument', {
    'source': '''
        Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined
        });

        Object.defineProperty(navigator, 'plugins', {
            get: () => [1, 2, 3, 4, 5]
        });

        Object.defineProperty(navigator, 'languages', {
            get: () => ['en-US', 'en']
        });

        window.chrome = {
            runtime: {}
        };
    '''
})

# Randomize behavior
def random_sleep(min_sec=1, max_sec=3):
    """Sleep for random duration"""
    time.sleep(random.uniform(min_sec, max_sec))

# Mimic human typing
def human_type(element, text):
    """Type text like a human"""
    for char in text:
        element.input(char, delay=random.randint(50, 200))

# Use it
page.get('https://example.com')
search_box = page.ele('#search')
human_type(search_box, 'search query')
random_sleep(0.5, 1.5)
search_box.enter()

Cloudflare bypass

Handle Cloudflare challenges gracefully.

from DrissionPage import ChromiumPage
import time

def bypass_cloudflare(page, url, max_wait=30):
    """
    Navigate through Cloudflare challenge
    """
    page.get(url)

    start_time = time.time()

    while time.time() - start_time < max_wait:
        # Check if we're on Cloudflare challenge page
        if 'challenge-platform' in page.url:
            print("Waiting for Cloudflare challenge...")

            # Wait for challenge to complete
            # Check for success indicators
            if page.ele('title:text=Just a moment'):
                time.sleep(5)  # Wait for Cloudflare to process
                continue

            # Check if we're through
            if not 'cloudflare' in page.url.lower():
                print("Cloudflare bypassed!")
                return True

        # Check if we're already on target page
        if not 'challenge' in page.url:
            return True

        time.sleep(2)

    print("Failed to bypass Cloudflare")
    return False

# Enhanced approach with browser profile
def create_stealth_browser():
    """Create browser that looks more human"""
    page = ChromiumPage(
        user_data_path='./browser_profile',  # Persist profile
        arguments=[
            '--disable-blink-features=AutomationControlled',
            '--exclude-switches=enable-automation',
            '--disable-infobars',
        ]
    )

    # Set realistic viewport
    page.set.window.size(1920, 1080)
    page.set.window.max()

    return page

# Usage
page = create_stealth_browser()
success = bypass_cloudflare(page, 'https://protected-site.com')

if success:
    # Proceed with scraping
    data = page.ele('.content').text
else:
    print("Could not access site")

Rate limiting and delays

Implement smart delays to avoid bans.

import time
import random
from DrissionPage import ChromiumPage

class SmartDelayer:
    def __init__(self, min_delay=1, max_delay=3):
        self.min_delay = min_delay
        self.max_delay = max_delay
        self.last_action = time.time()

    def wait(self):
        """Wait with random duration"""
        # Calculate time since last action
        elapsed = time.time() - self.last_action

        # Add random delay
        delay = random.uniform(self.min_delay, self.max_delay)
        if elapsed < delay:
            time.sleep(delay - elapsed)

        self.last_action = time.time()

    def human_action(self, action_func, *args, **kwargs):
        """Execute action with human-like delay"""
        self.wait()
        result = action_func(*args, **kwargs)
        return result

# Usage
page = ChromiumPage()
delayer = SmartDelays(min_delay=2, max_delay=5)

urls = ['url1.com', 'url2.com', 'url3.com']

for url in urls:
    delayer.human_action(page.get, url)

    # Process page
    data = page.ele('.content').text

    # Random longer pause between pages
    time.sleep(random.uniform(10, 20))

Concurrent scraping

Speed up scraping with multiple browsers.

Thread-based parallel scraping

from DrissionPage import ChromiumPage
import threading
import queue
import time

class WorkerThread(threading.Thread):
    def __init__(self, task_queue, result_queue):
        super().__init__()
        self.task_queue = task_queue
        self.result_queue = result_queue
        self.page = None

    def run(self):
        """Worker thread that processes tasks"""
        # Create browser instance for this thread
        self.page = ChromiumPage()

        while True:
            try:
                # Get task from queue
                task = self.task_queue.get(timeout=5)

                if task is None:  # Poison pill
                    break

                # Process task
                url, action = task
                try:
                    result = self.process_task(url, action)
                    self.result_queue.put(result)
                except Exception as e:
                    self.result_queue.put({'error': str(e), 'url': url})

                self.task_queue.task_done()

            except queue.Empty:
                continue

        # Cleanup
        if self.page:
            self.page.quit()

    def process_task(self, url, action):
        """Process single scraping task"""
        self.page.get(url)
        self.page.wait.load_start()

        if action == 'scrape':
            return {
                'url': url,
                'data': self.page.ele('.content').text,
            }
        elif action == 'screenshot':
            return {
                'url': url,
                'screenshot': self.page.get_screenshot(),
            }

def parallel_scrape(urls, num_workers=3):
    """Scrape multiple URLs in parallel"""
    # Create queues
    task_queue = queue.Queue()
    result_queue = queue.Queue()

    # Add tasks to queue
    for url in urls:
        task_queue.put((url, 'scrape'))

    # Add poison pills
    for _ in range(num_workers):
        task_queue.put(None)

    # Create and start workers
    workers = []
    for _ in range(num_workers):
        worker = WorkerThread(task_queue, result_queue)
        worker.start()
        workers.append(worker)

    # Wait for completion
    for worker in workers:
        worker.join()

    # Collect results
    results = []
    while not result_queue.empty():
        results.append(result_queue.get())

    return results

# Usage
urls = [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3',
    'https://example.com/page4',
    'https://example.com/page5',
]

results = parallel_scrape(urls, num_workers=3)
print(f"Scraped {len(results)} pages")

Resource management

class BrowserPool:
    """Pool of browser instances for efficient resource usage"""
    def __init__(self, max_size=5):
        self.pool = []
        self.max_size = max_size
        self.lock = threading.Lock()

    def acquire(self):
        """Get browser from pool"""
        with self.lock:
            if self.pool:
                return self.pool.pop()
            else:
                # Create new browser
                return ChromiumPage()

    def release(self, browser):
        """Return browser to pool"""
        with self.lock:
            if len(self.pool) < self.max_size:
                # Clear cookies and cache
                browser.cookies.clear_all()
                self.pool.append(browser)
            else:
                # Pool full, close browser
                browser.quit()

    def cleanup(self):
        """Close all browsers in pool"""
        with self.lock:
            for browser in self.pool:
                browser.quit()
            self.pool.clear()

# Usage
pool = BrowserPool(max_size=3)

try:
    browser = pool.acquire()
    browser.get('https://example.com')
    # Do work
    data = browser.ele('.content').text
    pool.release(browser)
finally:
    pool.cleanup()

Production-ready error handling

Real scrapers fail. Handle failures gracefully.

Retry with exponential backoff

import time
from functools import wraps

def retry(max_attempts=3, base_delay=1, max_delay=60):
    """Decorator for retry with exponential backoff"""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            attempt = 0
            while attempt < max_attempts:
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    attempt += 1
                    if attempt >= max_attempts:
                        raise

                    # Exponential backoff
                    delay = min(base_delay * (2 ** attempt), max_delay)
                    print(f"Attempt {attempt} failed, retrying in {delay}s...")
                    time.sleep(delay)

            return None
        return wrapper
    return decorator

# Usage
@retry(max_attempts=3, base_delay=2)
def scrape_page(page, url):
    page.get(url)
    page.wait.load_start()
    return page.ele('.content').text

# Try scraping
try:
    data = scrape_page(page, 'https://example.com')
except Exception as e:
    print(f"Failed after retries: {e}")

Comprehensive error handling

from DrissionPage import ChromiumPage
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class RobustScraper:
    def __init__(self):
        self.page = None
        self.max_retries = 3

    def init_browser(self):
        """Initialize browser with error handling"""
        try:
            self.page = ChromiumPage()
            logger.info("Browser initialized successfully")
            return True
        except Exception as e:
            logger.error(f"Failed to initialize browser: {e}")
            return False

    def safe_scrape(self, url, selector):
        """Scrape with comprehensive error handling"""
        if not self.page:
            if not self.init_browser():
                return None

        for attempt in range(self.max_retries):
            try:
                # Navigate with timeout
                self.page.get(url, timeout=30)

                # Wait for page load
                self.page.wait.load_start()

                # Find element
                element = self.page.ele(selector, timeout=10)

                if not element:
                    logger.warning(f"Element not found: {selector}")
                    return None

                # Extract data
                return element.text

            except TimeoutError:
                logger.warning(f"Timeout on attempt {attempt + 1}")
                if attempt < self.max_retries - 1:
                    time.sleep(2 ** attempt)  # Exponential backoff

            except Exception as e:
                logger.error(f"Error on attempt {attempt + 1}: {e}")
                if attempt == self.max_retries - 1:
                    # Last attempt failed, try restarting browser
                    self.restart_browser()

        return None

    def restart_browser(self):
        """Restart browser to recover from errors"""
        logger.info("Restarting browser...")
        try:
            if self.page:
                self.page.quit()
            self.init_browser()
        except Exception as e:
            logger.error(f"Failed to restart browser: {e}")

    def cleanup(self):
        """Cleanup resources"""
        if self.page:
            self.page.quit()

# Usage
scraper = RobustScraper()
try:
    data = scraper.safe_scrape(
        'https://example.com',
        '.main-content'
    )
    if data:
        print(f"Scraped: {data}")
finally:
    scraper.cleanup()

Production patterns

Real project patterns I use.

Configuration management

import yaml
import os

class Config:
    """Centralized configuration management"""
    def __init__(self, config_file='config.yaml'):
        with open(config_file) as f:
            self.config = yaml.safe_load(f)

    def get(self, key, default=None):
        """Get configuration value"""
        keys = key.split('.')
        value = self.config
        for k in keys:
            value = value.get(k)
            if value is None:
                return default
        return value

# config.yaml
"""
scraping:
  concurrent_browsers: 3
  page_timeout: 30
  retry_attempts: 3
  delay_between_requests: 2

proxies:
  - type: http
    host: proxy1.example.com
    port: 8080
  - type: socks5
    host: proxy2.example.com
    port: 1080

logging:
  level: INFO
  file: scraper.log
"""

# Usage
config = Config()
timeout = config.get('scraping.page_timeout', 30)
proxies = config.get('proxies', [])

Data pipeline integration

import csv
import json
from datetime import datetime

class DataPipeline:
    """Handle data extraction and storage"""
    def __init__(self, output_format='json'):
        self.output_format = output_format
        self.data = []

    def extract_data(self, page, schema):
        """Extract structured data based on schema"""
        record = {}
        for field, selector in schema.items():
            element = page.ele(selector)
            record[field] = element.text if element else None

        record['scraped_at'] = datetime.now().isoformat()
        self.data.append(record)
        return record

    def save(self, filename):
        """Save data to file"""
        if self.output_format == 'json':
            with open(filename, 'w') as f:
                json.dump(self.data, f, indent=2)

        elif self.output_format == 'csv':
            if self.data:
                keys = self.data[0].keys()
                with open(filename, 'w', newline='') as f:
                    writer = csv.DictWriter(f, fieldnames=keys)
                    writer.writeheader()
                    writer.writerows(self.data)

# Usage
pipeline = DataPipeline(output_format='json')

schema = {
    'title': '.product-title',
    'price': '.product-price',
    'description': '.product-description',
    'rating': '.rating'
}

for url in product_urls:
    page.get(url)
    page.wait.load_start()
    pipeline.extract_data(page, schema)

pipeline.save('products.json')

When to use DrissionPage vs alternatives

My decision framework.

Use Case Best Tool Why
Simple static sites requests + BeautifulSoup Fast, lightweight, no browser overhead
JavaScript-heavy sites DrissionPage Better detection bypass than Selenium
Simple automation Playwright Better documentation, multi-language
Large scale scraping Scrapy + DrissionPage Scrapy framework, DrissionPage downloader
Anti-bot protected sites DrissionPage Best detection bypass

Common intermediate issues

Problems I hit at this level.

Issue: Memory leaks with long-running scrapers

Fix: Periodically restart browser. Clear cache and cookies. Use browser pool with max size. Monitor memory usage and restart when threshold exceeded.

Issue: Session expiring mid-scrape

Fix: Implement session refresh logic. Detect auth failures (redirect to login). Store multiple credentials. Auto-relogin when session expires.

Issue: Captcha appearing frequently

Fix: Slow down requests. Rotate IP addresses with proxies. Solve captchas with 2Captcha or DeathByCaptcha. Use headful mode occasionally to see what's happening.

Issue: Dynamic content not loading

Fix: Use network_idle wait. Monitor XHR requests. Wait for specific element count. Check for WebSocket data. Some content loaded via lazy loading - scroll to trigger.

Issue: Browser crashes with many tabs

Fix: Limit concurrent browsers. Close tabs when done. Use incognito mode for isolation. Restart browser periodically.

Moving forward

At this level, you can handle most scraping challenges. Session management, dynamic content, detection bypass - these techniques cover 80% of real-world scenarios.

Next step would be advanced topics: distributed scraping with Redis queue, machine learning for CAPTCHA solving, building scrapers as microservices. But that's another article.

Best way to learn: pick a challenging site (e-commerce, social media, news aggregator) and build a production scraper. You'll hit problems not covered here - solving them is how you get to advanced level.

DrissionPage's GitHub repo has good examples. Join their Discord for community help. Read the source code when stuck - it's well-written.

Happy scraping. Stay under the radar.