Pyppeteer: Finally Got Puppeteer Working in Python

I wanted Puppeteer's power in Python. Pyppeteer seemed perfect - async/await, headless Chrome, the works. But the executable path errors and download issues drove me crazy. Here's what finally worked.

Why Pyppeteer Over Playwright?

Pyppeteer is a Python port of Puppeteer (Node.js). While Playwright is more modern, Pyppeteer still has advantages:

That said, Pyppeteer hasn't been updated since 2020. For new projects, consider Playwright-python or pyppeteer-stealth.

Problem

Pyppeteer automatically downloads Chromium, but I kept getting "Executable doesn't exist" errors. The path it expected didn't match where Chromium actually was.

Error: FileNotFoundError: [Errno 2] No such file or directory: '/home/user/.local/share/pyppeteer/local-chromium/588429/chrome-linux/chrome'

What I Tried

Attempt 1: Reinstalled pyppeteer - Downloaded Chromium but still couldn't find it
Attempt 2: Set CHROME_PATH environment variable - Pyppeteer ignored it
Attempt 3: Manually downloaded Chromium - Wrong version, incompatible with Pyppeteer

Actual Fix

The issue is that Pyppeteer downloads Chromium asynchronously but doesn't always complete. Force the download and specify the executable path explicitly:

import asyncio
from pyppeteer import launch

async def force_chromium_download():
    """
    Force Pyppeteer to download Chromium
    Run this once before using pyppeteer
    """
    import pyppeteer.chromium_downloader

    print("Downloading Chromium...")
    chromium_path = await pyppeteer.chromium_downloader.download_chromium()

    print(f"Chromium downloaded to: {chromium_path}")
    return chromium_path

async def scrape_with_pyppeteer():
    """
    Scrape with explicit executable path
    """
    # Get the correct Chromium path
    import pyppeteer.chromium_downloader
    chromium_path = pyppeteer.chromium_downloader.get_chromium_executable_path()

    print(f"Using Chromium at: {chromium_path}")

    # Launch with explicit path
    browser = await launch(
        executablePath=chromium_path,
        headless=True,
        args=['--no-sandbox', '--disable-setuid-sandbox']
    )

    page = await browser.newPage()

    await page.goto('https://example.com')

    title = await page.title()
    print(f"Page title: {title}")

    await browser.close()

# Run
if __name__ == '__main__':
    # First time: download Chromium
    # asyncio.run(force_chromium_download())

    # Then use it
    asyncio.run(scrape_with_pyppeteer())

Problem

On first run, Pyppeteer tries to download Chromium (~300MB) but kept timing out. The download would hang at random percentages and fail.

Error: DownloadTimeoutError: Chromium download timed out after 30 seconds

What I Tried

Attempt 1: Increased timeout - Still failed on slow connections
Attempt 2: Used VPN - Helped but download was extremely slow
Attempt 3: Downloaded manually from Google - Wrong version, Pyppeteer rejected it

Actual Fix

Download Chromium manually from a mirror and set the path. China users should use a mirror:

# Fix 1: Use download mirror for China users
import os
os.environ['PYPPETEER_DOWNLOAD_HOST'] = 'https://cdn.npmmirror.com/binraries'

# Fix 2: Manual download and configuration
"""
Manual Chromium download:

1. Find your Pyppeteer version's Chromium revision:
   python -c "import pyppeteer.__main__; print(pyppeteer.__main__.CHROMIUM_REVISION)"

2. Download Chromium from:
   https://chromiumbrowserdl.appspot.com/REVISION/Linux_x64.zip

3. Extract to:
   ~/.local/share/pyppeteer/local-chromium/REVISION/

4. Verify path with get_chromium_executable_path()
"""

async def setup_chromium_manually():
    """
    Configure manually downloaded Chromium
    """
    import os
    from pyppeteer import launch

    # Option 1: Use environment variable
    os.environ['PYPPETEER_CHROMIUM_REVISION'] = '588429'

    # Option 2: Specify custom executable path
    custom_chrome_path = '/usr/bin/google-chrome'  # System Chrome

    browser = await launch(
        executablePath=custom_chrome_path,
        headless=True,
        args=['--no-sandbox']
    )

    return browser

# Fix 3: Use system Chrome (simpler, more compatible)
async def use_system_chrome():
    """
    Use system-installed Chrome instead of downloaded Chromium
    """
    # Check available Chrome installations
    chrome_paths = [
        '/usr/bin/google-chrome',
        '/usr/bin/google-chrome-stable',
        '/usr/bin/chromium-browser',
        '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',  # macOS
        'C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe'  # Windows
    ]

    import os
    for path in chrome_paths:
        if os.path.exists(path):
            print(f"Found Chrome at: {path}")
            return path

    print("No system Chrome found, will download Chromium")
    return None

Problem

When building an async crawler to scrape multiple pages concurrently, pages would hang or never resolve. The async context wasn't being managed properly.

What I Tried

Attempt 1: Created multiple browser instances - Ran out of memory
Attempt 2: Used asyncio.gather() - Pages still hung
Attempt 3: Added timeouts - Helped but didn't fix root cause

Actual Fix

The key is reusing browser context and properly managing page lifecycles:

import asyncio
from pyppeteer import launch
from typing import List
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class AsyncPyppeteerCrawler:
    """
    Production async crawler with Pyppeteer
    """

    def __init__(self, concurrency: int = 5):
        """
        Args:
            concurrency: Number of concurrent pages to scrape
        """
        self.concurrency = concurrency
        self.browser = None

    async def init(self):
        """Initialize browser (call once)"""
        self.browser = await launch(
            headless=True,
            args=['--no-sandbox', '--disable-dev-shm-usage']
        )
        logger.info("Browser launched")

    async def close(self):
        """Close browser (call when done)"""
        if self.browser:
            await self.browser.close()
            logger.info("Browser closed")

    async def scrape_page(self, url: str) -> dict:
        """
        Scrape a single page

        Args:
            url: URL to scrape

        Returns:
            Scraped data
        """
        if not self.browser:
            raise RuntimeError("Call init() first")

        page = None
        try:
            # Create new page
            page = await self.browser.newPage()

            # Set viewport and user agent
            await page.setViewport({'width': 1920, 'height': 1080})
            await page.setUserAgent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36')

            # Navigate with timeout
            await page.goto(url, {'timeout': 30000, 'waitUntil': 'networkidle0'})

            # Wait for content
            await page.waitForSelector('body', {'timeout': 10000})

            # Extract data
            data = await page.evaluate('''() => ({
                title: document.title,
                url: window.location.href,
                bodyLength: document.body.innerText.length
            })''')

            logger.info(f"Scraped: {url}")
            return data

        except Exception as e:
            logger.error(f"Error scraping {url}: {e}")
            return {'error': str(e), 'url': url}

        finally:
            # Always close page to free memory
            if page:
                await page.close()

    async def scrape_urls(self, urls: List[str]) -> List[dict]:
        """
        Scrape multiple URLs concurrently

        Args:
            urls: List of URLs to scrape

        Returns:
            List of scraped data
        """
        # Create semaphore to limit concurrency
        semaphore = asyncio.Semaphore(self.concurrency)

        async def scrape_with_semaphore(url):
            async with semaphore:
                return await self.scrape_page(url)

        # Scrape all URLs concurrently
        results = await asyncio.gather(
            *[scrape_with_semaphore(url) for url in urls],
            return_exceptions=True
        )

        # Filter exceptions
        return [r for r in results if not isinstance(r, Exception)]

# Usage
async def main():
    crawler = AsyncPyppeteerCrawler(concurrency=3)

    try:
        await crawler.init()

        urls = [
            'https://example.com/page1',
            'https://example.com/page2',
            'https://example.com/page3',
            'https://example.com/page4',
            'https://example.com/page5'
        ]

        results = await crawler.scrape_urls(urls)

        for result in results:
            if 'error' not in result:
                print(f"{result['url']}: {result['title']}")

    finally:
        await crawler.close()

# Run
if __name__ == '__main__':
    asyncio.run(main())

What I Learned

Better Alternatives in 2026

1. Playwright-python (Recommended)

Playwright is the modern successor to Puppeteer with better Python support:

# Playwright is more reliable
from playwright.async_api import async_playwright

async def scrape_with_playwright():
    async with async_playwright() as pw:
        browser = await pw.chromium.launch()
        page = await browser.new_page()
        await page.goto('https://example.com')
        title = await page.title()
        await browser.close()
        return title

2. pyppeteer-stealth

If you must use Pyppeteer, add stealth mode to avoid bot detection:

from pyppeteer_stealth import stealth
from pyppeteer import launch

async def scrape_with_stealth():
    browser = await launch(headless=True)
    page = await browser.newPage()

    # Apply stealth patches
    await stealth(page)

    await page.goto('https://bot-detection.example.com')
    # Now less likely to be detected

Production Setup That Works

# pyppeteer_scraper.py - Production configuration

import asyncio
import os
from pyppeteer import launch
from typing import List, Dict
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class PyppeteerScraper:
    """
    Production Pyppeteer scraper with:
    - System Chrome support
    - Concurrent crawling
    - Error handling
    - Memory management
    """

    def __init__(
        self,
        headless: bool = True,
        concurrency: int = 3,
        chrome_path: str = None
    ):
        """
        Args:
            headless: Run headless browser
            concurrency: Number of concurrent pages
            chrome_path: Path to Chrome executable (None = auto-detect)
        """
        self.headless = headless
        self.concurrency = concurrency
        self.chrome_path = chrome_path or self._find_chrome()
        self.browser = None

    def _find_chrome(self) -> str:
        """Find system Chrome executable"""
        paths = {
            'Linux': [
                '/usr/bin/google-chrome',
                '/usr/bin/google-chrome-stable',
                '/usr/bin/chromium-browser'
            ],
            'Darwin': [  # macOS
                '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'
            ],
            'Windows': [
                'C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe',
                'C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe'
            ]
        }

        import platform
        system = platform.system()

        for path in paths.get(system, []):
            if os.path.exists(path):
                logger.info(f"Found Chrome at: {path}")
                return path

        logger.warning("No system Chrome found, will use Pyppeteer's Chromium")
        return None

    async def init(self):
        """Initialize browser"""
        launch_args = {
            'headless': self.headless,
            'args': [
                '--no-sandbox',
                '--disable-setuid-sandbox',
                '--disable-dev-shm-usage',
                '--disable-blink-features=AutomationControlled'
            ]
        }

        if self.chrome_path:
            launch_args['executablePath'] = self.chrome_path

        self.browser = await launch(**launch_args)
        logger.info("Browser initialized")

    async def close(self):
        """Close browser"""
        if self.browser:
            await self.browser.close()
            self.browser = None

    async def scrape_url(self, url: str, wait_for: str = None) -> Dict:
        """
        Scrape a single URL

        Args:
            url: URL to scrape
            wait_for: CSS selector to wait for (optional)

        Returns:
            Scraped data
        """
        if not self.browser:
            await self.init()

        page = None
        try:
            page = await self.browser.newPage()

            # Set viewport
            await page.setViewport({'width': 1920, 'height': 1080})

            # Navigate
            await page.goto(url, {'timeout': 30000, 'waitUntil': 'networkidle0'})

            # Wait for selector if specified
            if wait_for:
                await page.waitForSelector(wait_for, {'timeout': 10000})

            # Extract data
            data = await page.evaluate('''() => ({
                title: document.title,
                url: window.location.href,
                html: document.documentElement.outerHTML.substring(0, 10000)
            })''')

            logger.info(f"Scraped: {url}")
            return data

        except Exception as e:
            logger.error(f"Error scraping {url}: {e}")
            return {'error': str(e), 'url': url}

        finally:
            if page:
                await page.close()

    async def scrape_batch(self, urls: List[str]) -> List[Dict]:
        """
        Scrape multiple URLs concurrently

        Args:
            urls: URLs to scrape

        Returns:
            List of scraped data
        """
        semaphore = asyncio.Semaphore(self.concurrency)

        async def scrape_with_limit(url):
            async with semaphore:
                return await self.scrape_url(url)

        results = await asyncio.gather(
            *[scrape_with_limit(url) for url in urls],
            return_exceptions=True
        )

        return [r for r in results if not isinstance(r, Exception)]

    async def __aenter__(self):
        await self.init()
        return self

    async def __aexit__(self, exc_type, exc_val, exc_tb):
        await self.close()

# Usage
async def main():
    async with PyppeteerScraper(concurrency=3) as scraper:
        urls = [
            'https://example.com',
            'https://example.org',
            'https://example.net'
        ]

        results = await scraper.scrape_batch(urls)

        for result in results:
            if 'error' not in result:
                print(f"{result['url']}: {result['title']}")

if __name__ == '__main__':
    asyncio.run(main())

Monitoring & Debugging

Common Issues

Debug Helper

async def debug_pyppeteer():
    """Debug Pyppeteer setup"""
    # Check Chrome path
    from pyppeteer.chromium_downloader import get_chromium_executable_path
    print(f"Chromium path: {get_chromium_executable_path()}")

    # Check system Chrome
    scraper = PyppeteerScraper()
    print(f"System Chrome: {scraper.chrome_path}")

    # Test launch
    browser = await scraper.init()
    print(f"Browser launched: {browser is not None}")
    await scraper.close()

Related Resources

⚠️ Maintenance Note

Pyppeteer hasn't been updated since 2020. For new projects, consider Playwright-python or Selenium. This article is for those maintaining existing Pyppeteer codebases.