Facebook Page Scraper: Scraping Business Data Without API Login

Needed to check what competitors were doing on Facebook. Their posts, customer comments, common questions. Facebook's API requires approval and has rate limits. Didn't feel like dealing with any of that.

So I wrote a scraper that just loads public pages. No login, no API key. Been using it for a few months now.

Facebook pages are public. The issue is the JavaScript rendering and bot detection. Regular Selenium gets blocked immediately. What works is undetected-chromedriver - it's a patched version of Selenium that bypasses detection.

Install this stuff:

pip install selenium undetected-chromedriver beautifulsoup4 pandas

You need Chrome installed. The driver downloads automatically.

Watch out for Chrome version. If you update Chrome immediately when a new version drops, undetected-chromedriver might not support it yet. I stay on Chrome 120-125 range.

Here's the scraper:

import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
import pandas as pd
import re

def scrape_facebook_page(page_url, max_scrolls=10):
    options = uc.ChromeOptions()

    # Uncomment for production
    # options.add_argument('--headless')

    options.add_argument('--disable-blink-features=AutomationControlled')
    options.add_argument('--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36')

    driver = uc.Chrome(options=options)

    try:
        driver.get(page_url)

        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, '[role="feed"]'))
        )

        posts_data = []

        for i in range(max_scrolls):
            soup = BeautifulSoup(driver.page_source, 'html')

            # These selectors change. Check current page structure.
            posts = soup.find_all('div', {'data-pagelet': True})

            for post in posts:
                try:
                    text_elem = post.find('div', {'data-ad-preview': 'message'})
                    text = text_elem.get_text(strip=True) if text_elem else ''

                    reactions = post.find('span', {'class': re.compile(r'.*like.*', re.I)})
                    reactions_count = reactions.get_text(strip=True) if reactions else '0'

                    link_elem = post.find('a', href=re.compile(r'/posts/'))
                    link = 'https://facebook.com' + link_elem['href'] if link_elem else page_url

                    if text and text not in [p['text'] for p in posts_data]:
                        posts_data.append({
                            'text': text,
                            'reactions': reactions_count,
                            'link': link
                        })
                except:
                    continue

            driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
            time.sleep(2)

        return pd.DataFrame(posts_data)

    finally:
        driver.quit()

I do 5-10 scrolls normally. Gets a few dozen posts. More than that and you start getting duplicates as the feed refreshes.

The comments are more interesting. People ask about shipping, pricing, problems with orders. Here's how I pull those:

def get_post_comments(post_url):
    driver = uc.Chrome()
    comments_data = []

    try:
        driver.get(post_url)
        time.sleep(3)

        # Expand comments
        for _ in range(3):
            try:
                view_more = driver.find_element(By.XPATH, '//div[contains(text(), "View more comments")]')
                view_more.click()
                time.sleep(2)
            except:
                break

        soup = BeautifulSoup(driver.page_source, 'html')
        comment_blocks = soup.find_all('div', {'aria-label': re.compile(r'Comment', re.I)})

        for block in comment_blocks:
            try:
                author = block.find('span', {'class': re.compile(r'.*name.*', re.I)})
                author_name = author.get_text(strip=True) if author else 'Unknown'

                text = block.find('div', {'data-testid': 'comment_message'})
                comment_text = text.get_text(strip=True) if text else ''

                # Pull out contact info if people leave it
                email_match = re.search(r'[\w\.-]+@[\w\.-]+\.\w+', comment_text)
                phone_match = re.search(r'[\+]?[(]?[0-9]{3}[)]?[-\s\.]?[0-9]{3}[-\s\.]?[0-9]{4,6}', comment_text)

                if comment_text:
                    comments_data.append({
                        'author': author_name,
                        'comment': comment_text,
                        'email': email_match.group() if email_match else '',
                        'phone': phone_match.group() if phone_match else ''
                    })
            except:
                continue

        return pd.DataFrame(comments_data)

    finally:
        driver.quit()

If I see a bunch of comments asking about international shipping, I know that's a weak spot for that competitor. Useful stuff.

The about page usually has email/phone:

def get_page_contact_info(page_url):
    about_url = page_url.rstrip('/') + '/about'
    driver = uc.Chrome()

    try:
        driver.get(about_url)
        time.sleep(5)

        soup = BeautifulSoup(driver.page_source, 'html')

        contact_info = {'email': '', 'phone': '', 'website': '', 'address': ''}

        page_text = soup.get_text()
        email_match = re.search(r'[\w\.-]+@[\w\.-]+\.\w+', page_text)
        if email_match:
            contact_info['email'] = email_match.group()

        phone_match = re.search(r'[\+]?[(]?[0-9]{3}[)]?[-\s\.]?[0-9]{3}[-\s\.]?[0-9]{4,6}', page_text)
        if phone_match:
            contact_info['phone'] = phone_match.group()

        for link in soup.find_all('a', href=True):
            if 'http' in link['href'] and 'facebook' not in link['href']:
                contact_info['website'] = link['href']
                break

        return contact_info

    finally:
        driver.quit()

Run through a list of competitors with delays:

import random

competitor_pages = [
    'https://www.facebook.com/competitor1',
    'https://www.facebook.com/competitor2',
    'https://www.facebook.com/competitor3',
]

all_data = []

for page in competitor_pages:
    try:
        posts = scrape_facebook_page(page, max_scrolls=5)
        contact = get_page_contact_info(page)

        posts['page_url'] = page
        posts['contact_email'] = contact['email']
        posts['contact_phone'] = contact['phone']

        all_data.append(posts)

        time.sleep(random.uniform(5, 15))

    except Exception as e:
        print(f"Failed on {page}: {e}")
        continue

final_df = pd.concat(all_data, ignore_index=True)
final_df.to_csv('competitor_analysis.csv', index=False)

5-15 seconds between requests works. Less than that and you'll get temp blocked.

Some issues I hit:

Getting blocked - If you see captcha, slow down. Try a different user agent or run at off-hours. Facebook's detection varies by time of day.

Selectors breaking - Facebook changes class names constantly. What worked last week might break today. Use role attributes and data attributes when possible.

Empty results - Usually means the page is private, JS didn't finish loading, or content is geo-blocked.

Memory - Chrome leaks memory. Restart browser every 20-30 pages.

This works for light scraping. If you need thousands of pages regularly, look into proxies and rotating user agents.

I save everything to a database and run weekly scans. Helps track how competitor content changes over time.

Haven't figured out geo-location handling yet - Facebook shows different content based on IP. Let me know if you have a cheap solution for this.

Legal note: only scrape public pages. Private profiles and groups are against Facebook's terms.