Crawl4AI Notes: Converting Websites to Markdown for LLMs

Background

I've been working on a project that needs to scrape web content and feed it into an LLM. Tried a bunch of options - BeautifulSoup, Scrapy, even paid services like Firecrawl. Eventually found Crawl4AI and it's been working pretty well for my use case.

The main thing I like is that it outputs clean markdown by default. With BeautifulSoup you get raw HTML and have to clean it up yourself. Crawl4AI handles navigation, ads, and other noise automatically.

It's built on top of Playwright so it can handle JavaScript-heavy sites. Downside is you need to install browser binaries which adds some overhead.

Installation

Pretty straightforward with pip. The thing that tripped me up initially was forgetting to install the browser binaries separately.

pip install crawl4ai
playwright install chromium

The browser download is around 150-200MB so it might take a bit depending on your connection. When I ran this on a server with slow internet, it timed out a few times.

After installation, you can verify it works:

python -c "import crawl4ai; print('OK')"

If you're on Windows and having issues, try temporarily disabling your antivirus or adding Python to the exclusions list. The Playwright download sometimes gets flagged.

Basic Usage

The async version is what you want to use. It's significantly faster than the sync version, especially when crawling multiple pages.

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url="https://www.example.com")
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

The result object has a few useful properties:

result.markdown        # Clean markdown output
result.html            # Raw HTML
result.cleaned_html    # HTML with noise removed
result.links           # List of extracted links
result.media           # Images and media files
result.successful      # Boolean
result.status_code     # HTTP status code

I usually just use result.markdown since that's what gets fed into the LLM. The cleaned_html is sometimes useful for debugging.

Using with LLMs

Once you have the markdown, you can pass it directly to an LLM. Here's a simple example with OpenAI:

import asyncio
from crawl4ai import AsyncWebCrawler
from openai import OpenAI

client = OpenAI()

async def crawl_and_summarize(url):
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url=url)
        scraped_markdown = result.markdown

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "Summarize the given content."},
                {"role": "user", "content": scraped_markdown[:8000]}  # Truncate if needed
            ]
        )

        return response.choices[0].message.content

summary = asyncio.run(crawl_and_summarize("https://example.com"))
print(summary)

For RAG applications, I've been using ChromaDB to store the embeddings:

import chromadb

client = chromadb.Client()
collection = chroma_client.create_collection("docs")

async def crawl_and_index(url):
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url=url)
        chunks = [result.markdown[i:i+1000] for i in range(0, len(result.markdown), 1000)]

        for i, chunk in enumerate(chunks):
            embedding = client.embeddings.create(
                model="text-embedding-3-small",
                input=chunk
            ).data[0].embedding

            collection.add(
                embeddings=[embedding],
                documents=[chunk],
                ids=[f"{url}_{i}"]
            )

One thing to watch out for is token limits. Some pages have a ton of content and you'll need to chunk them before sending to the LLM. I've been using 1000 character chunks which seems to work okay.

Some Advanced Stuff I Found Useful

CSS Selectors

If you only want a specific part of the page, use css_selector:

result = await crawler.arun(
    url="https://example.com",
    css_selector="article.main-content"
)

JavaScript Execution

Some sites load content dynamically. You can run custom JS:

result = await crawler.arun(
    url="https://example.com",
    js_code=[
        "document.querySelectorAll('.ad-banner').forEach(el => el.remove());",
        "window.scrollTo(0, document.body.scrollHeight);"
    ],
    wait_for="networkidle"
)

Rate Limiting

Got blocked a few times when crawling too fast. Added a delay between requests:

async def crawl_with_limit(urls, delay=2):
    async with AsyncWebCrawler(verbose=True) as crawler:
        for url in urls:
            result = await crawler.arun(url=url)
            yield result
            await asyncio.sleep(delay)

Issues I Ran Into

Empty Results

Happened when the site loads content via JavaScript. Fixed by adding wait_for parameter:

result = await crawler.arun(
    url="https://example.com",
    wait_for="networkidle"
)

403 Errors / Bot Detection

Some sites will block you. Things that helped:

Add delays between requests (1-2 seconds minimum)
Set a realistic user agent
Use headless=False for testing
For production, you'll probably need proxies

Memory Leaks

When crawling a lot of pages, the browser instance can eat up memory. Make sure to close it properly:

async with AsyncWebCrawler(verbose=True) as crawler:
    # do your crawling
    pass
# Browser closes automatically here

Slow Performance

Turn off verbose mode once you know it's working:

AsyncWebCrawler(verbose=False)

General Advice

Check robots.txt before crawling
Don't hammer servers with requests
Cache results when possible
Handle exceptions properly - network stuff fails more than you'd expect

Random Thoughts

Overall, Crawl4AI has been working well for me. The markdown output is pretty clean and it handles JavaScript sites without much fuss.

If you're doing simple scraping and don't need JavaScript, BeautifulSoup is still faster and lighter. But for anything dynamic or when you need LLM-ready output, Crawl4AI saves a lot of time.

Haven't tried Firecrawl (the paid alternative) since Crawl4AI does what I need for free. Might be worth looking into if you're doing this at scale and need better reliability.

Link to the project: github.com/unclecode/crawl4ai