Apify SDK: Finally Got Scraping to Scale

My Selenium setup worked for 10K pages but fell apart at 100K. Memory leaks, blocked IPs, and single-machine bottlenecks killed it. Apify SDK handled the migration and scaling with minimal code changes.

Why I Migrated to Apify

When scraping at scale, Selenium/BeautifulSoup setups hit hard limits:

Apify SDK solves these with built-in queues, distributed execution, proxy management, and storage.

Problem

I had 20+ Selenium scrapers running in production. Rewriting them completely for Apify would take weeks. I needed a migration path that didn't require rewriting all scraping logic.

What I Tried

Attempt 1: Rewrote scrapers from scratch using Apify Cheerio - Took too long, lost functionality
Attempt 2: Used Apify SDK just for queue/storage - Missed out on proxy/autoscaling benefits
Attempt 3: Mixed approach with Puppeteer - Complex architecture, hard to maintain

Actual Fix

Apify SDK supports Playwright out of the box, which is similar to Selenium. The migration was mostly about wrapping existing logic in Apify's structure:

from apify import Actor, Request
from apify_playwright import PlaywrightCrawlingContext
from playwright.async_api import Page
import asyncio

# Before: Selenium scraper
def selenium_scraper(url: str):
    from selenium import webdriver
    from selenium.webdriver.common.by import By

    driver = webdriver.Chrome()
    driver.get(url)

    # Scraping logic
    title = driver.find_element(By.TAG_NAME, 'h1').text
    price = driver.find_element(By.CSS_SELECTOR, '.price').text

    driver.quit()

    return {'title': title, 'price': price}

# After: Apify with Playwright (minimal changes)
async def apify_scraper(context: PlaywrightCrawlingContext):
    """
    Apify scraper using Playwright
    Similar to Selenium but with built-in scaling
    """
    page: Page = context.page

    # Same scraping logic, different API
    title = await page.locator('h1').text_content()
    price = await page.locator('.price').text_content()

    # Push to dataset (automatic storage)
    await context.push_data({
        'url': context.request.url,
        'title': title,
        'price': price
    })

# Main entry point
async def main():
    async with Actor:
        # Configure crawler
        actor_context = Actor.config

        # Initialize crawler
        crawler = actor_context.crawler(
            PlaywrightCrawlingContext,
            # Similar to Selenium options
            headless=True,
            browser_type='chromium'
        )

        # Add URLs to crawl
        await crawler.add_requests([
            'https://example.com/product/1',
            'https://example.com/product/2',
            'https://example.com/product/3'
        ])

        # Run crawler
        await crawler.run(apify_scraper)

# Run
if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

Problem

When my crawler crashed or was restarted, it would lose track of which URLs were processed. The in-memory queue didn't persist, causing duplicate work and missing pages.

What I Tried

Attempt 1: Stored processed URLs in database - Worked but added complexity
Attempt 2: Used Redis queue - Better but required separate infrastructure
Attempt 3: Checkpoint to file - Messy with concurrent crawlers

Actual Fix

Apify's RequestQueue persists automatically. It survives restarts and supports distributed crawling:

from apify import Actor, RequestQueue

async def main():
    async with Actor:
        # Initialize request queue (auto-persists)
        queue = await Actor.open_request_queue()

        # Add requests (only adds if not already processed)
        await queue.add_request(
            Request.from_url('https://example.com/page1')
        )
        await queue.add_request(
            Request.from_url('https://example.com/page2')
        )

        # Process queue
        while not queue.is_finished():
            request = await queue.fetch_next_request()

            if request:
                try:
                    # Process the request
                    data = await scrape_page(request.url)

                    # Save to dataset
                    await Actor.push_data(data)

                    # Mark as handled (persists to storage)
                    await queue.mark_request_as_handled(request)

                except Exception as e:
                    # Mark for retry (with backoff)
                    await queue.reclaim_request(request)

        # Queue state persists - restart and it continues from where it left off

# With automatic checkpointing
async def crawler_with_checkpoint(context):
    """
    Apify automatically checkpoints queue state
    Restart crawler and it continues from last position
    """
    # Process page
    data = await scrape_page(context.request.url)

    # Save data
    await context.push_data(data)

    # Queue automatically persisted after each batch

Problem

When scraping at scale, individual proxies would get blocked. I needed automatic rotation with smart fallback, but Apify's proxy configuration was confusing.

What I Tried

Attempt 1: Used Apify Proxy without configuration - Blocked quickly
Attempt 2: Bought cheap proxy list and rotated manually - Most proxies already burned
Attempt 3: Used single datacenter proxy - Got blocked after 1K requests

Actual Fix

Configure Apify Proxy with proper session management and smart rotation:

from apify import Actor
from apify_playwright import PlaywrightCrawlingContext

async def main():
    async with Actor:
        # Configure proxy (Apify Proxy or custom)
        proxy_configuration = {
            'proxyUrls': [
                'http://proxy1.example.com:8000',
                'http://proxy2.example.com:8000',
                'http://proxy3.example.com:8000'
            ],
            # Or use Apify's residential proxy
            # 'apifyProxy': {
            #     'groups': ['RESIDENTIAL'],
            #     'countryCode': 'US'
            # }
        }

        # Initialize crawler with proxy
        crawler = Actor.config.crawler(
            PlaywrightCrawlingContext,
            proxy_configuration=proxy_configuration,
            # Session management (key for proxy rotation)
            use_session_pool=True,
            session_pool_options={
                'max_pool_size': 100,  # Max concurrent sessions
                'session_options': {
                    'max_usage_count': 10,  # Reuse session 10 times
                    'error_score_threshold': 0.5  # Abandon session after errors
                }
            }
        )

        # Run crawler
        await crawler.run(scrape_with_rotation)

async def scrape_with_rotation(context: PlaywrightCrawlingContext):
    """
    Each session gets its own proxy
    Sessions are automatically rotated
    """
    page = context.page

    # Scrape with current session/proxy
    data = await scrape_page(page)

    # If this session has too many errors,
    # Apify automatically abandons it and starts a new one

    await context.push_data(data)

# Alternative: Custom proxy rotation
async def custom_proxy_rotation():
    """Rotate proxies based on success/error"""
    from itertools import cycle

    proxies = cycle([
        'http://proxy1.com:8000',
        'http://proxy2.com:8000',
        'http://proxy3.com:8000'
    ])

    async with Actor:
        crawler = Actor.config.crawler(
            PlaywrightCrawlingContext,
            # Custom proxy rotation function
            proxy_configuration=lambda req: next(proxies)
        )

        await crawler.run(scrape_with_rotation)

What I Learned

Apify vs Selenium: Feature Comparison

Feature Selenium Apify SDK
Request Queue Manual (Redis, DB) Built-in, persists
Distributed Crawling Manual (Redis, Celery) Built-in
Proxy Rotation Manual Built-in with smart rotation
Storage Manual (files, DB) Built-in Dataset, KeyValueStore
Retry Logic Manual Built-in with backoff
Checkpointing Manual Automatic
Autoscaling Manual (K8s, etc.) Built-in on Apify Platform
Deployment Manual (Docker, etc.) One command to Apify Platform

Production Setup That Works

# apify_scraper.py - Production configuration

from apify import Actor, Request
from apify_playwright import PlaywrightCrawlingContext
from typing import Dict
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class ProductScraper:
    """
    Production scraper with Apify SDK

    Features:
    - Automatic queue management
    - Proxy rotation
    - Error handling with retries
    - Automatic storage
    """

    def __init__(self):
        self.actor_config = None

    async def run(self):
        """Main entry point"""
        async with Actor:
            self.actor_config = Actor.config

            # Configure crawler
            crawler = self.actor_config.crawler(
                PlaywrightCrawlingContext,
                # Browser configuration
                headless=True,
                browser_type='chromium',
                max_pages_per_crawl=1000,
                # Concurrency
                max_concurrency=10,
                # Proxy configuration
                proxy_configuration={
                    'apifyProxy': {
                        'groups': ['RESIDENTIAL'],  # Use residential proxies
                        'countryCode': 'US'
                    }
                },
                # Session management
                use_session_pool=True,
                session_pool_options={
                    'max_pool_size': 100,
                    'session_options': {
                        'max_usage_count': 10,  # Reuse each session 10 times
                        'error_score_threshold': 0.5  # Abandon after errors
                    }
                },
                # Retry configuration
                retry_on_blocked=True,
                max_request_retries=3
            )

            # Add initial URLs
            await crawler.add_requests([
                'https://example.com/products?page=1',
                'https://example.com/products?page=2',
                'https://example.com/products?page=3'
            ])

            # Run crawler
            await crawler.run(self.handle_page)

    async def handle_page(self, context: PlaywrightCrawlingContext):
        """
        Handle each page

        Args:
            context: Playwright crawling context
        """
        page = context.page
        request: Request = context.request

        try:
            logger.info(f"Processing: {request.url}")

            # Wait for content to load
            await page.wait_for_selector('.product-list', timeout=10000)

            # Extract product data
            products = await page.eval_on_selector_all(
                '.product-item',
                '''elements => elements.map(el => ({
                    id: el.getAttribute('data-id'),
                    name: el.querySelector('.name')?.textContent,
                    price: el.querySelector('.price')?.textContent,
                    in_stock: el.querySelector('.stock')?.textContent === 'In Stock'
                }))'''
            )

            # Add metadata
            for product in products:
                product['url'] = request.url
                product['scraped_at'] = Actor.config.started_at.isoformat()

            # Save to dataset (automatic storage)
            if products:
                await context.push_data(products)
                logger.info(f"Saved {len(products)} products")

            # Find and enqueue next pages
            next_pages = await page.eval_on_selector_all(
                'a.pagination-link[href*="/products?page="]',
                '''elements => elements.map(el => el.href)'''
            )

            for next_page_url in next_pages:
                await context.add_requests(
                    [Request.from_url(next_page_url)]
                )

        except Exception as e:
            logger.error(f"Error processing {request.url}: {e}")
            # Apify will retry this request automatically

# Deployment helper
def deploy_to_apify():
    """
    Deploy scraper to Apify Platform

    Requires:
    - Apify account
    - APIFY_TOKEN environment variable
    """
    import subprocess

    # Login (first time only)
    subprocess.run(['apify', 'login'], check=True)

    # Push to Apify Platform
    subprocess.run(['apify', 'push'], check=True)

    print("Deployed to Apify Platform!")
    print("View at: https://console.apify.com/actors")

# Local development helper
async def run_locally():
    """Run scraper locally for testing"""
    import asyncio

    scraper = ProductScraper()
    await scraper.run()

# Main entry point
if __name__ == '__main__':
    import asyncio

    # Run locally
    asyncio.run(run_locally())

    # Or deploy to Apify
    # deploy_to_apify()

Working with Apify Storage

Dataset Operations

from apify import Actor

async def dataset_operations():
    """Work with Apify datasets"""
    async with Actor:
        # Push individual items
        await Actor.push_data({
            'url': 'https://example.com',
            'title': 'Example'
        })

        # Push multiple items
        await Actor.push_data([
            {'id': 1, 'name': 'Item 1'},
            {'id': 2, 'name': 'Item 2'},
            {'id': 3, 'name': 'Item 3'}
        ])

        # Get dataset info
        dataset = await Actor.open_dataset()
        info = await dataset.get_info()
        print(f"Dataset has {info.item_count} items")

        # Export to various formats
        await dataset.export_to_csv('output.csv')
        await dataset.export_to_json('output.json')
        await dataset.export_to_xml('output.xml')

        # Stream items (for large datasets)
        async for item in dataset.iterate_items():
            process_item(item)

KeyValueStore for Files

async def key_value_store_operations():
    """Store files, screenshots, etc."""
    async with Actor:
        store = await Actor.open_key_value_store()

        # Store screenshots
        # await page.screenshot(path='screenshot.png')
        # await store.set_value(
        #     'homepage.png',
        #     open('screenshot.png', 'rb'),
        #     content_type='image/png'
        # )

        # Store JSON config
        await store.set_value('config', {
            'last_run': '2026-03-23',
            'pages_processed': 1000
        })

        # Store HTML
        await store.set_value('page.html', '...')

        # Retrieve values
        config = await store.get_value('config')
        print(f"Config: {config}")

Monitoring & Debugging

Local Testing

# Install Apify CLI
npm install -g apify-cli

# Initialize new actor
apify init my-scraper

# Run locally (with live reload)
apify run

# Test with specific input
echo '{"urls": ["https://example.com"]}' | apify run

Common Issues

Related Resources

✓ Migration Checklist: Selenium to Apify