Pyppeteer: Finally Got Puppeteer Working in Python
I wanted Puppeteer's power in Python. Pyppeteer seemed perfect - async/await, headless Chrome, the works. But the executable path errors and download issues drove me crazy. Here's what finally worked.
Why Pyppeteer Over Playwright?
Pyppeteer is a Python port of Puppeteer (Node.js). While Playwright is more modern, Pyppeteer still has advantages:
- Same API as Puppeteer: If you know Puppeteer, you know Pyppeteer
- Auto-downloads Chromium: No manual Chrome installation needed
- Native async/await: Built for Python 3.5+ asyncio
- Smaller footprint: More lightweight than Playwright
- Stable: Battle-tested in production since 2018
That said, Pyppeteer hasn't been updated since 2020. For new projects, consider Playwright-python or pyppeteer-stealth.
Problem
Pyppeteer automatically downloads Chromium, but I kept getting "Executable doesn't exist" errors. The path it expected didn't match where Chromium actually was.
Error: FileNotFoundError: [Errno 2] No such file or directory: '/home/user/.local/share/pyppeteer/local-chromium/588429/chrome-linux/chrome'
What I Tried
Attempt 1: Reinstalled pyppeteer - Downloaded Chromium but still couldn't find it
Attempt 2: Set CHROME_PATH environment variable - Pyppeteer ignored it
Attempt 3: Manually downloaded Chromium - Wrong version, incompatible with Pyppeteer
Actual Fix
The issue is that Pyppeteer downloads Chromium asynchronously but doesn't always complete. Force the download and specify the executable path explicitly:
import asyncio
from pyppeteer import launch
async def force_chromium_download():
"""
Force Pyppeteer to download Chromium
Run this once before using pyppeteer
"""
import pyppeteer.chromium_downloader
print("Downloading Chromium...")
chromium_path = await pyppeteer.chromium_downloader.download_chromium()
print(f"Chromium downloaded to: {chromium_path}")
return chromium_path
async def scrape_with_pyppeteer():
"""
Scrape with explicit executable path
"""
# Get the correct Chromium path
import pyppeteer.chromium_downloader
chromium_path = pyppeteer.chromium_downloader.get_chromium_executable_path()
print(f"Using Chromium at: {chromium_path}")
# Launch with explicit path
browser = await launch(
executablePath=chromium_path,
headless=True,
args=['--no-sandbox', '--disable-setuid-sandbox']
)
page = await browser.newPage()
await page.goto('https://example.com')
title = await page.title()
print(f"Page title: {title}")
await browser.close()
# Run
if __name__ == '__main__':
# First time: download Chromium
# asyncio.run(force_chromium_download())
# Then use it
asyncio.run(scrape_with_pyppeteer())
Problem
On first run, Pyppeteer tries to download Chromium (~300MB) but kept timing out. The download would hang at random percentages and fail.
Error: DownloadTimeoutError: Chromium download timed out after 30 seconds
What I Tried
Attempt 1: Increased timeout - Still failed on slow connections
Attempt 2: Used VPN - Helped but download was extremely slow
Attempt 3: Downloaded manually from Google - Wrong version, Pyppeteer rejected it
Actual Fix
Download Chromium manually from a mirror and set the path. China users should use a mirror:
# Fix 1: Use download mirror for China users
import os
os.environ['PYPPETEER_DOWNLOAD_HOST'] = 'https://cdn.npmmirror.com/binraries'
# Fix 2: Manual download and configuration
"""
Manual Chromium download:
1. Find your Pyppeteer version's Chromium revision:
python -c "import pyppeteer.__main__; print(pyppeteer.__main__.CHROMIUM_REVISION)"
2. Download Chromium from:
https://chromiumbrowserdl.appspot.com/REVISION/Linux_x64.zip
3. Extract to:
~/.local/share/pyppeteer/local-chromium/REVISION/
4. Verify path with get_chromium_executable_path()
"""
async def setup_chromium_manually():
"""
Configure manually downloaded Chromium
"""
import os
from pyppeteer import launch
# Option 1: Use environment variable
os.environ['PYPPETEER_CHROMIUM_REVISION'] = '588429'
# Option 2: Specify custom executable path
custom_chrome_path = '/usr/bin/google-chrome' # System Chrome
browser = await launch(
executablePath=custom_chrome_path,
headless=True,
args=['--no-sandbox']
)
return browser
# Fix 3: Use system Chrome (simpler, more compatible)
async def use_system_chrome():
"""
Use system-installed Chrome instead of downloaded Chromium
"""
# Check available Chrome installations
chrome_paths = [
'/usr/bin/google-chrome',
'/usr/bin/google-chrome-stable',
'/usr/bin/chromium-browser',
'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome', # macOS
'C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe' # Windows
]
import os
for path in chrome_paths:
if os.path.exists(path):
print(f"Found Chrome at: {path}")
return path
print("No system Chrome found, will download Chromium")
return None
Problem
When building an async crawler to scrape multiple pages concurrently, pages would hang or never resolve. The async context wasn't being managed properly.
What I Tried
Attempt 1: Created multiple browser instances - Ran out of memory
Attempt 2: Used asyncio.gather() - Pages still hung
Attempt 3: Added timeouts - Helped but didn't fix root cause
Actual Fix
The key is reusing browser context and properly managing page lifecycles:
import asyncio
from pyppeteer import launch
from typing import List
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class AsyncPyppeteerCrawler:
"""
Production async crawler with Pyppeteer
"""
def __init__(self, concurrency: int = 5):
"""
Args:
concurrency: Number of concurrent pages to scrape
"""
self.concurrency = concurrency
self.browser = None
async def init(self):
"""Initialize browser (call once)"""
self.browser = await launch(
headless=True,
args=['--no-sandbox', '--disable-dev-shm-usage']
)
logger.info("Browser launched")
async def close(self):
"""Close browser (call when done)"""
if self.browser:
await self.browser.close()
logger.info("Browser closed")
async def scrape_page(self, url: str) -> dict:
"""
Scrape a single page
Args:
url: URL to scrape
Returns:
Scraped data
"""
if not self.browser:
raise RuntimeError("Call init() first")
page = None
try:
# Create new page
page = await self.browser.newPage()
# Set viewport and user agent
await page.setViewport({'width': 1920, 'height': 1080})
await page.setUserAgent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36')
# Navigate with timeout
await page.goto(url, {'timeout': 30000, 'waitUntil': 'networkidle0'})
# Wait for content
await page.waitForSelector('body', {'timeout': 10000})
# Extract data
data = await page.evaluate('''() => ({
title: document.title,
url: window.location.href,
bodyLength: document.body.innerText.length
})''')
logger.info(f"Scraped: {url}")
return data
except Exception as e:
logger.error(f"Error scraping {url}: {e}")
return {'error': str(e), 'url': url}
finally:
# Always close page to free memory
if page:
await page.close()
async def scrape_urls(self, urls: List[str]) -> List[dict]:
"""
Scrape multiple URLs concurrently
Args:
urls: List of URLs to scrape
Returns:
List of scraped data
"""
# Create semaphore to limit concurrency
semaphore = asyncio.Semaphore(self.concurrency)
async def scrape_with_semaphore(url):
async with semaphore:
return await self.scrape_page(url)
# Scrape all URLs concurrently
results = await asyncio.gather(
*[scrape_with_semaphore(url) for url in urls],
return_exceptions=True
)
# Filter exceptions
return [r for r in results if not isinstance(r, Exception)]
# Usage
async def main():
crawler = AsyncPyppeteerCrawler(concurrency=3)
try:
await crawler.init()
urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3',
'https://example.com/page4',
'https://example.com/page5'
]
results = await crawler.scrape_urls(urls)
for result in results:
if 'error' not in result:
print(f"{result['url']}: {result['title']}")
finally:
await crawler.close()
# Run
if __name__ == '__main__':
asyncio.run(main())
What I Learned
- Lesson 1: Always explicitly set executable path - Pyppeteer's auto-detection is flaky.
- Lesson 2: For production, use system Chrome instead of downloaded Chromium - more stable.
- Lesson 3: Always close pages after use - memory leaks will crash your crawler.
- Lesson 4: Use semaphore for concurrency control - don't create unlimited concurrent pages.
- Overall: Pyppeteer works but requires more setup than expected. Consider pyppeteer-stealth for anti-bot detection.
Better Alternatives in 2026
1. Playwright-python (Recommended)
Playwright is the modern successor to Puppeteer with better Python support:
# Playwright is more reliable
from playwright.async_api import async_playwright
async def scrape_with_playwright():
async with async_playwright() as pw:
browser = await pw.chromium.launch()
page = await browser.new_page()
await page.goto('https://example.com')
title = await page.title()
await browser.close()
return title
2. pyppeteer-stealth
If you must use Pyppeteer, add stealth mode to avoid bot detection:
from pyppeteer_stealth import stealth
from pyppeteer import launch
async def scrape_with_stealth():
browser = await launch(headless=True)
page = await browser.newPage()
# Apply stealth patches
await stealth(page)
await page.goto('https://bot-detection.example.com')
# Now less likely to be detected
Production Setup That Works
# pyppeteer_scraper.py - Production configuration
import asyncio
import os
from pyppeteer import launch
from typing import List, Dict
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class PyppeteerScraper:
"""
Production Pyppeteer scraper with:
- System Chrome support
- Concurrent crawling
- Error handling
- Memory management
"""
def __init__(
self,
headless: bool = True,
concurrency: int = 3,
chrome_path: str = None
):
"""
Args:
headless: Run headless browser
concurrency: Number of concurrent pages
chrome_path: Path to Chrome executable (None = auto-detect)
"""
self.headless = headless
self.concurrency = concurrency
self.chrome_path = chrome_path or self._find_chrome()
self.browser = None
def _find_chrome(self) -> str:
"""Find system Chrome executable"""
paths = {
'Linux': [
'/usr/bin/google-chrome',
'/usr/bin/google-chrome-stable',
'/usr/bin/chromium-browser'
],
'Darwin': [ # macOS
'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'
],
'Windows': [
'C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe',
'C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe'
]
}
import platform
system = platform.system()
for path in paths.get(system, []):
if os.path.exists(path):
logger.info(f"Found Chrome at: {path}")
return path
logger.warning("No system Chrome found, will use Pyppeteer's Chromium")
return None
async def init(self):
"""Initialize browser"""
launch_args = {
'headless': self.headless,
'args': [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-blink-features=AutomationControlled'
]
}
if self.chrome_path:
launch_args['executablePath'] = self.chrome_path
self.browser = await launch(**launch_args)
logger.info("Browser initialized")
async def close(self):
"""Close browser"""
if self.browser:
await self.browser.close()
self.browser = None
async def scrape_url(self, url: str, wait_for: str = None) -> Dict:
"""
Scrape a single URL
Args:
url: URL to scrape
wait_for: CSS selector to wait for (optional)
Returns:
Scraped data
"""
if not self.browser:
await self.init()
page = None
try:
page = await self.browser.newPage()
# Set viewport
await page.setViewport({'width': 1920, 'height': 1080})
# Navigate
await page.goto(url, {'timeout': 30000, 'waitUntil': 'networkidle0'})
# Wait for selector if specified
if wait_for:
await page.waitForSelector(wait_for, {'timeout': 10000})
# Extract data
data = await page.evaluate('''() => ({
title: document.title,
url: window.location.href,
html: document.documentElement.outerHTML.substring(0, 10000)
})''')
logger.info(f"Scraped: {url}")
return data
except Exception as e:
logger.error(f"Error scraping {url}: {e}")
return {'error': str(e), 'url': url}
finally:
if page:
await page.close()
async def scrape_batch(self, urls: List[str]) -> List[Dict]:
"""
Scrape multiple URLs concurrently
Args:
urls: URLs to scrape
Returns:
List of scraped data
"""
semaphore = asyncio.Semaphore(self.concurrency)
async def scrape_with_limit(url):
async with semaphore:
return await self.scrape_url(url)
results = await asyncio.gather(
*[scrape_with_limit(url) for url in urls],
return_exceptions=True
)
return [r for r in results if not isinstance(r, Exception)]
async def __aenter__(self):
await self.init()
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
await self.close()
# Usage
async def main():
async with PyppeteerScraper(concurrency=3) as scraper:
urls = [
'https://example.com',
'https://example.org',
'https://example.net'
]
results = await scraper.scrape_batch(urls)
for result in results:
if 'error' not in result:
print(f"{result['url']}: {result['title']}")
if __name__ == '__main__':
asyncio.run(main())
Monitoring & Debugging
Common Issues
- Executable not found: Set chrome_path explicitly
- Memory leaks: Always close pages after use
- Timeouts: Increase timeout or check page loading
- Bot detection: Use pyppeteer-stealth
Debug Helper
async def debug_pyppeteer():
"""Debug Pyppeteer setup"""
# Check Chrome path
from pyppeteer.chromium_downloader import get_chromium_executable_path
print(f"Chromium path: {get_chromium_executable_path()}")
# Check system Chrome
scraper = PyppeteerScraper()
print(f"System Chrome: {scraper.chrome_path}")
# Test launch
browser = await scraper.init()
print(f"Browser launched: {browser is not None}")
await scraper.close()
Related Resources
⚠️ Maintenance Note
Pyppeteer hasn't been updated since 2020. For new projects, consider Playwright-python or Selenium. This article is for those maintaining existing Pyppeteer codebases.