Tweepy: Scraping Twitter After the API Price Hike
I needed to scrape 50K tweets for a research project. Then X/Twitter changed their API pricing in 2023 and the free tier became practically useless. Here's what actually works in 2026.
The API Pricing Problem
Before 2023, Tweepy with the free API tier could fetch thousands of tweets per month. After Elon's acquisition, everything changed:
- Free tier: 500 tweets per month (down from thousands)
- Basic tier: $100/month for 10K tweets
- Pro tier: $5,000/month for 1M tweets
- Enterprise: Contact sales (you can guess what that means)
For my research project, the Basic tier would cost $500. Not happening.
Problem
I enabled Tweepy's automatic rate limit handling, but still kept getting 429 errors after about 200 requests. The API was cutting me off before the documented limits.
Error: 429 Too Many Requests - Rate limit exceeded
What I Tried
Attempt 1: Set wait_on_rate_limit=True - Still hit 429 errors
Attempt 2: Added manual delays between requests - Worked but incredibly slow (10 req/min)
Attempt 3: Used multiple bearer tokens rotating - Got banned after an hour
Actual Fix
The issue was that Twitter's v2 API has undocumented rate limits that differ from v1.1. The fix combines proper authentication with aggressive rate limiting.
import tweepy
import time
# Use OAuth 1.0a user context instead of bearer token
# User context has higher rate limits than app-only auth
client = tweepy.Client(
bearer_token="YOUR_BEARER_TOKEN",
consumer_key="YOUR_CONSUMER_KEY",
consumer_secret="YOUR_CONSUMER_SECRET",
access_token="YOUR_ACCESS_TOKEN",
access_token_secret="YOUR_ACCESS_TOKEN_SECRET",
wait_on_rate_limit=True
)
# Additional safety margin - request 30% less than limit
def safe_get_tweets(query, max_results=100):
tweets = []
try:
# Start with conservative rate limit
for _ in range(30): # Well below documented limits
response = client.search_recent_tweets(
query=query,
max_results=min(max_results, 100),
tweet_fields=['created_at', 'public_metrics']
)
if response.data:
tweets.extend(response.data)
# Add 2 second delay between requests
time.sleep(2)
if len(response.data) < max_results:
break
except tweepy.errors.TooManyRequests:
print("Hit rate limit, waiting 15 min...")
time.sleep(900)
return tweets
Problem
Some v2 endpoints returned 403 Forbidden even with valid authentication. Search worked, but user lookup and timeline fetching failed.
Error: 403 Forbidden - Endpoint not accessible with current access level
What I Tried
Attempt 1: Regenerated bearer token - Same 403 errors
Attempt 2: Switched to OAuth 1.0a user context - Some endpoints worked, others still 403
Attempt 3: Checked Twitter Developer Portal - My access level was "Free" (most restricted)
Actual Fix
The 403 errors were because the Free tier doesn't access to most v2 endpoints. I had to use a hybrid approach: scrape publicly available data without authentication when possible.
# For public tweets, use the search endpoint (available on Free tier)
# For user data, you'll need a different approach
import requests
from bs4 import BeautifulSoup
def scrape_user_profile(username):
"""
Fallback: Scrape public profile data without API
Only works for publicly available information
"""
url = f"https://twitter.com/{username}"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
# Parse public data from HTML
# Note: This breaks frequently as Twitter changes their DOM
return {
'username': username,
'public_data': 'extract from HTML'
}
return None
Problem
When using the filtered stream endpoint, the connection would drop after ~30 seconds with a "Stream ended" message. No errors, just silent disconnection.
What I Tried
Attempt 1: Added keep-alive ping - Didn't help
Attempt 2: Increased timeout values - Connection still dropped
Attempt 3: Tried different rules - Same issue
Actual Fix
Twitter's streaming API on the Free tier has severe restrictions. The solution is to use polling with exponential backoff instead of streaming.
import tweepy
import time
def poll_tweets(query, interval=30):
"""
Polling alternative to streaming API
More reliable on free tier, works with rate limits
"""
client = tweepy.Client(bearer_token="YOUR_TOKEN")
seen_tweets = set()
while True:
try:
response = client.search_recent_tweets(
query=query,
max_results=100,
tweet_fields=['created_at', 'author_id']
)
if response.data:
for tweet in response.data:
if tweet.id not in seen_tweets:
seen_tweets.add(tweet.id)
yield tweet
# Wait before next poll
time.sleep(interval)
except tweepy.errors.TooManyRequests:
# Exponential backoff
wait_time = min(interval * 2, 900) # Max 15 min
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
except Exception as e:
print(f"Error: {e}")
time.sleep(60)
What I Learned
- Lesson 1: The documented rate limits are maximums, not guarantees. Twitter enforces stricter limits dynamically.
- Lesson 2: OAuth 1.0a user context gets better limits than bearer token auth, but requires more setup.
- Lesson 3: Streaming API on Free tier is practically unusable. Polling with backoff is more reliable.
- Overall: For serious scraping in 2026, the official API is only viable with paid tier. Free tier requires combining API calls with other methods.
API-Free Alternatives
When the official API won't cut it, here are alternatives that still work in 2026:
1. Nitter Instances (Public Frontends)
Nitter is an open-source Twitter frontend. Public instances don't require authentication:
import requests
from bs4 import BeautifulSoup
def scrape_with_nitter(username):
"""
Use public Nitter instances to avoid API entirely
Note: Instances go down frequently, need fallback list
"""
instances = [
"nitter.net",
"nitter.poast.org",
"nitter.privacydev.net"
]
for instance in instances:
try:
url = f"https://{instance}/{username}"
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers, timeout=10)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
# Parse tweets from HTML
tweets = soup.find_all('div', class_='timeline-item')
return [extract_tweet_data(t) for t in tweets]
except Exception as e:
continue
return None
2. Browser Automation (Last Resort)
For limited data needs, undetected-chromedriver can work:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
import time
def scrape_with_selenium(url):
"""
Only use for small-scale scraping
Twitter detects automation quickly
"""
driver = uc.Chrome()
driver.get(url)
# Wait for manual login if needed
time.sleep(10)
# Scroll to load more tweets
for _ in range(5):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
# Extract data
tweets = driver.find_elements(By.CSS_SELECTOR, '[data-testid="tweet"]')
# Parse and return...
driver.quit()
Production Setup That Works
Here's my final setup that reliably fetches tweets without hitting API limits:
# twitter_scraper.py - Production configuration
import tweepy
import time
from typing import List, Generator
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class TwitterScraper:
def __init__(self, bearer_token: str, consumer_key: str = None,
consumer_secret: str = None, access_token: str = None,
access_token_secret: str = None):
"""
Initialize with OAuth 1.0a user context for better rate limits
Falls back to bearer-only if user context not provided
"""
self.client = tweepy.Client(
bearer_token=bearer_token,
consumer_key=consumer_key,
consumer_secret=consumer_secret,
access_token=access_token,
access_token_secret=access_token_secret,
wait_on_rate_limit=True
)
def fetch_tweets(self, query: str, max_tweets: int = 1000) -> Generator:
"""
Fetch tweets with built-in rate limit protection
Args:
query: Search query
max_tweets: Maximum tweets to fetch
Yields:
Tweet objects
"""
collected = 0
next_token = None
while collected < max_tweets:
try:
response = self.client.search_recent_tweets(
query=query,
max_results=min(100, max_tweets - collected),
next_token=next_token,
tweet_fields=['created_at', 'public_metrics', 'author_id']
)
if not response.data:
logger.info("No more tweets available")
break
for tweet in response.data:
yield tweet
collected += 1
next_token = response.meta.get('next_token')
if not next_token:
break
# Conservative delay between requests
time.sleep(3)
except tweepy.errors.TooManyRequests:
logger.warning("Rate limit hit, waiting 15 min")
time.sleep(900)
except Exception as e:
logger.error(f"Error fetching tweets: {e}")
time.sleep(60)
# Usage
if __name__ == "__main__":
scraper = TwitterScraper(
bearer_token="YOUR_TOKEN",
consumer_key="YOUR_KEY",
consumer_secret="YOUR_SECRET"
)
for tweet in scraper.fetch_tweets("python programming", max_tweets=500):
print(f"@{tweet.author_id}: {tweet.text[:100]}...")
Monitoring & Debugging
When scraping Twitter at scale, watch for these red flags:
Red Flags to Watch For
- 429 errors increasing: Your rate limit calculations are off
- Empty responses: Twitter may be silently rate-limiting you
- Authentication errors: Your token may have been revoked
- Sudden disconnections: IP-based throttling, switch proxies
Debugging Checklist
# Check your current rate limit status
curl -X GET "https://api.twitter.com/2/users/me?user.fields=public_metrics" \
-H "Authorization: Bearer $BEARER_TOKEN"
# Test authentication
curl -X GET "https://api.twitter.com/2/tweets/search/recent?query=test" \
-H "Authorization: Bearer $BEARER_TOKEN"
# Monitor headers for rate limit info
# Look for: x-rate-limit-remaining, x-rate-limit-reset
Related Resources
⚠️ Legal Note
Web scraping Twitter's data may violate their Terms of Service. This article is for educational purposes. Always check Twitter's current ToS and API terms before scraping. Consider using the official API with appropriate licensing for production use.