Apify SDK: Finally Got Scraping to Scale
My Selenium setup worked for 10K pages but fell apart at 100K. Memory leaks, blocked IPs, and single-machine bottlenecks killed it. Apify SDK handled the migration and scaling with minimal code changes.
Why I Migrated to Apify
When scraping at scale, Selenium/BeautifulSoup setups hit hard limits:
- Single machine bottleneck: One machine can only handle so many concurrent requests
- State management: No built-in queue, retry logic, or state persistence
- Proxy rotation: Manual proxy management is complex and error-prone
- Memory leaks: Long-running Selenium processes eventually OOM
- Deployment: No easy way to distribute across multiple machines
Apify SDK solves these with built-in queues, distributed execution, proxy management, and storage.
Problem
I had 20+ Selenium scrapers running in production. Rewriting them completely for Apify would take weeks. I needed a migration path that didn't require rewriting all scraping logic.
What I Tried
Attempt 1: Rewrote scrapers from scratch using Apify Cheerio - Took too long, lost functionality
Attempt 2: Used Apify SDK just for queue/storage - Missed out on proxy/autoscaling benefits
Attempt 3: Mixed approach with Puppeteer - Complex architecture, hard to maintain
Actual Fix
Apify SDK supports Playwright out of the box, which is similar to Selenium. The migration was mostly about wrapping existing logic in Apify's structure:
from apify import Actor, Request
from apify_playwright import PlaywrightCrawlingContext
from playwright.async_api import Page
import asyncio
# Before: Selenium scraper
def selenium_scraper(url: str):
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get(url)
# Scraping logic
title = driver.find_element(By.TAG_NAME, 'h1').text
price = driver.find_element(By.CSS_SELECTOR, '.price').text
driver.quit()
return {'title': title, 'price': price}
# After: Apify with Playwright (minimal changes)
async def apify_scraper(context: PlaywrightCrawlingContext):
"""
Apify scraper using Playwright
Similar to Selenium but with built-in scaling
"""
page: Page = context.page
# Same scraping logic, different API
title = await page.locator('h1').text_content()
price = await page.locator('.price').text_content()
# Push to dataset (automatic storage)
await context.push_data({
'url': context.request.url,
'title': title,
'price': price
})
# Main entry point
async def main():
async with Actor:
# Configure crawler
actor_context = Actor.config
# Initialize crawler
crawler = actor_context.crawler(
PlaywrightCrawlingContext,
# Similar to Selenium options
headless=True,
browser_type='chromium'
)
# Add URLs to crawl
await crawler.add_requests([
'https://example.com/product/1',
'https://example.com/product/2',
'https://example.com/product/3'
])
# Run crawler
await crawler.run(apify_scraper)
# Run
if __name__ == '__main__':
import asyncio
asyncio.run(main())
Problem
When my crawler crashed or was restarted, it would lose track of which URLs were processed. The in-memory queue didn't persist, causing duplicate work and missing pages.
What I Tried
Attempt 1: Stored processed URLs in database - Worked but added complexity
Attempt 2: Used Redis queue - Better but required separate infrastructure
Attempt 3: Checkpoint to file - Messy with concurrent crawlers
Actual Fix
Apify's RequestQueue persists automatically. It survives restarts and supports distributed crawling:
from apify import Actor, RequestQueue
async def main():
async with Actor:
# Initialize request queue (auto-persists)
queue = await Actor.open_request_queue()
# Add requests (only adds if not already processed)
await queue.add_request(
Request.from_url('https://example.com/page1')
)
await queue.add_request(
Request.from_url('https://example.com/page2')
)
# Process queue
while not queue.is_finished():
request = await queue.fetch_next_request()
if request:
try:
# Process the request
data = await scrape_page(request.url)
# Save to dataset
await Actor.push_data(data)
# Mark as handled (persists to storage)
await queue.mark_request_as_handled(request)
except Exception as e:
# Mark for retry (with backoff)
await queue.reclaim_request(request)
# Queue state persists - restart and it continues from where it left off
# With automatic checkpointing
async def crawler_with_checkpoint(context):
"""
Apify automatically checkpoints queue state
Restart crawler and it continues from last position
"""
# Process page
data = await scrape_page(context.request.url)
# Save data
await context.push_data(data)
# Queue automatically persisted after each batch
Problem
When scraping at scale, individual proxies would get blocked. I needed automatic rotation with smart fallback, but Apify's proxy configuration was confusing.
What I Tried
Attempt 1: Used Apify Proxy without configuration - Blocked quickly
Attempt 2: Bought cheap proxy list and rotated manually - Most proxies already burned
Attempt 3: Used single datacenter proxy - Got blocked after 1K requests
Actual Fix
Configure Apify Proxy with proper session management and smart rotation:
from apify import Actor
from apify_playwright import PlaywrightCrawlingContext
async def main():
async with Actor:
# Configure proxy (Apify Proxy or custom)
proxy_configuration = {
'proxyUrls': [
'http://proxy1.example.com:8000',
'http://proxy2.example.com:8000',
'http://proxy3.example.com:8000'
],
# Or use Apify's residential proxy
# 'apifyProxy': {
# 'groups': ['RESIDENTIAL'],
# 'countryCode': 'US'
# }
}
# Initialize crawler with proxy
crawler = Actor.config.crawler(
PlaywrightCrawlingContext,
proxy_configuration=proxy_configuration,
# Session management (key for proxy rotation)
use_session_pool=True,
session_pool_options={
'max_pool_size': 100, # Max concurrent sessions
'session_options': {
'max_usage_count': 10, # Reuse session 10 times
'error_score_threshold': 0.5 # Abandon session after errors
}
}
)
# Run crawler
await crawler.run(scrape_with_rotation)
async def scrape_with_rotation(context: PlaywrightCrawlingContext):
"""
Each session gets its own proxy
Sessions are automatically rotated
"""
page = context.page
# Scrape with current session/proxy
data = await scrape_page(page)
# If this session has too many errors,
# Apify automatically abandons it and starts a new one
await context.push_data(data)
# Alternative: Custom proxy rotation
async def custom_proxy_rotation():
"""Rotate proxies based on success/error"""
from itertools import cycle
proxies = cycle([
'http://proxy1.com:8000',
'http://proxy2.com:8000',
'http://proxy3.com:8000'
])
async with Actor:
crawler = Actor.config.crawler(
PlaywrightCrawlingContext,
# Custom proxy rotation function
proxy_configuration=lambda req: next(proxies)
)
await crawler.run(scrape_with_rotation)
What I Learned
- Lesson 1: Apify's RequestQueue is the killer feature - automatic persistence and distributed support.
- Lesson 650: Migration from Selenium is easier than expected - Playwright is similar, and Apify wraps the complexity.
- Lesson 3: Built-in proxy rotation is worth it - custom implementations always have edge cases.
- Lesson 4: Storage is automatic - no more database schemas for scraped data.
- Overall: For production scraping, Apify SDK's batteries-included approach saves weeks of engineering time.
Apify vs Selenium: Feature Comparison
| Feature | Selenium | Apify SDK |
|---|---|---|
| Request Queue | Manual (Redis, DB) | Built-in, persists |
| Distributed Crawling | Manual (Redis, Celery) | Built-in |
| Proxy Rotation | Manual | Built-in with smart rotation |
| Storage | Manual (files, DB) | Built-in Dataset, KeyValueStore |
| Retry Logic | Manual | Built-in with backoff |
| Checkpointing | Manual | Automatic |
| Autoscaling | Manual (K8s, etc.) | Built-in on Apify Platform |
| Deployment | Manual (Docker, etc.) | One command to Apify Platform |
Production Setup That Works
# apify_scraper.py - Production configuration
from apify import Actor, Request
from apify_playwright import PlaywrightCrawlingContext
from typing import Dict
import logging
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class ProductScraper:
"""
Production scraper with Apify SDK
Features:
- Automatic queue management
- Proxy rotation
- Error handling with retries
- Automatic storage
"""
def __init__(self):
self.actor_config = None
async def run(self):
"""Main entry point"""
async with Actor:
self.actor_config = Actor.config
# Configure crawler
crawler = self.actor_config.crawler(
PlaywrightCrawlingContext,
# Browser configuration
headless=True,
browser_type='chromium',
max_pages_per_crawl=1000,
# Concurrency
max_concurrency=10,
# Proxy configuration
proxy_configuration={
'apifyProxy': {
'groups': ['RESIDENTIAL'], # Use residential proxies
'countryCode': 'US'
}
},
# Session management
use_session_pool=True,
session_pool_options={
'max_pool_size': 100,
'session_options': {
'max_usage_count': 10, # Reuse each session 10 times
'error_score_threshold': 0.5 # Abandon after errors
}
},
# Retry configuration
retry_on_blocked=True,
max_request_retries=3
)
# Add initial URLs
await crawler.add_requests([
'https://example.com/products?page=1',
'https://example.com/products?page=2',
'https://example.com/products?page=3'
])
# Run crawler
await crawler.run(self.handle_page)
async def handle_page(self, context: PlaywrightCrawlingContext):
"""
Handle each page
Args:
context: Playwright crawling context
"""
page = context.page
request: Request = context.request
try:
logger.info(f"Processing: {request.url}")
# Wait for content to load
await page.wait_for_selector('.product-list', timeout=10000)
# Extract product data
products = await page.eval_on_selector_all(
'.product-item',
'''elements => elements.map(el => ({
id: el.getAttribute('data-id'),
name: el.querySelector('.name')?.textContent,
price: el.querySelector('.price')?.textContent,
in_stock: el.querySelector('.stock')?.textContent === 'In Stock'
}))'''
)
# Add metadata
for product in products:
product['url'] = request.url
product['scraped_at'] = Actor.config.started_at.isoformat()
# Save to dataset (automatic storage)
if products:
await context.push_data(products)
logger.info(f"Saved {len(products)} products")
# Find and enqueue next pages
next_pages = await page.eval_on_selector_all(
'a.pagination-link[href*="/products?page="]',
'''elements => elements.map(el => el.href)'''
)
for next_page_url in next_pages:
await context.add_requests(
[Request.from_url(next_page_url)]
)
except Exception as e:
logger.error(f"Error processing {request.url}: {e}")
# Apify will retry this request automatically
# Deployment helper
def deploy_to_apify():
"""
Deploy scraper to Apify Platform
Requires:
- Apify account
- APIFY_TOKEN environment variable
"""
import subprocess
# Login (first time only)
subprocess.run(['apify', 'login'], check=True)
# Push to Apify Platform
subprocess.run(['apify', 'push'], check=True)
print("Deployed to Apify Platform!")
print("View at: https://console.apify.com/actors")
# Local development helper
async def run_locally():
"""Run scraper locally for testing"""
import asyncio
scraper = ProductScraper()
await scraper.run()
# Main entry point
if __name__ == '__main__':
import asyncio
# Run locally
asyncio.run(run_locally())
# Or deploy to Apify
# deploy_to_apify()
Working with Apify Storage
Dataset Operations
from apify import Actor
async def dataset_operations():
"""Work with Apify datasets"""
async with Actor:
# Push individual items
await Actor.push_data({
'url': 'https://example.com',
'title': 'Example'
})
# Push multiple items
await Actor.push_data([
{'id': 1, 'name': 'Item 1'},
{'id': 2, 'name': 'Item 2'},
{'id': 3, 'name': 'Item 3'}
])
# Get dataset info
dataset = await Actor.open_dataset()
info = await dataset.get_info()
print(f"Dataset has {info.item_count} items")
# Export to various formats
await dataset.export_to_csv('output.csv')
await dataset.export_to_json('output.json')
await dataset.export_to_xml('output.xml')
# Stream items (for large datasets)
async for item in dataset.iterate_items():
process_item(item)
KeyValueStore for Files
async def key_value_store_operations():
"""Store files, screenshots, etc."""
async with Actor:
store = await Actor.open_key_value_store()
# Store screenshots
# await page.screenshot(path='screenshot.png')
# await store.set_value(
# 'homepage.png',
# open('screenshot.png', 'rb'),
# content_type='image/png'
# )
# Store JSON config
await store.set_value('config', {
'last_run': '2026-03-23',
'pages_processed': 1000
})
# Store HTML
await store.set_value('page.html', '...')
# Retrieve values
config = await store.get_value('config')
print(f"Config: {config}")
Monitoring & Debugging
Local Testing
# Install Apify CLI
npm install -g apify-cli
# Initialize new actor
apify init my-scraper
# Run locally (with live reload)
apify run
# Test with specific input
echo '{"urls": ["https://example.com"]}' | apify run
Common Issues
- Queue not persisting: Ensure you're using Actor.open_request_queue(), not in-memory queue
- Proxy blocked: Switch proxy groups (RESIDENTIAL, DATACENTER)
- Memory issues: Reduce max_concurrency or increase memory in Actor config
- Slow crawling: Increase concurrency or optimize page waits
Related Resources
✓ Migration Checklist: Selenium to Apify
- ✓ Install Apify SDK:
pip install apify apify-playwright - ✓ Convert scraping logic to async/await
- ✓ Replace Selenium selectors with Playwright equivalents
- ✓ Wrap in Actor context
- ✓ Use Actor.push_data() instead of manual storage
- ✓ Configure proxy rotation
- ✓ Test locally with
apify run - ✓ Deploy to Apify Platform with
apify push