Ethical web scraper for collecting recipes from allrecipes.com.
- Sitemap-based discovery - Finds all ~48,000 recipe URLs automatically
- Sitemap caching - Sitemaps cached locally for faster subsequent runs
- Ethical rate limiting - 2-5 second random delays between requests
- Full resumability - Interrupt anytime with Ctrl+C, resume where you left off
- Rich data extraction - Title, ingredients, instructions, nutrition, ratings, and more
- JSONL output - One recipe per line, easy to process incrementally
Requires uv and Python 3.10+.
git clone https://github.com/YOUR_USERNAME/recipe-scraper.git
cd recipe-scraper
uv sync# Test with a few recipes
uv run scrape_allrecipes.py --limit 10
# Full scrape (will take ~47 hours)
uv run scrape_allrecipes.py
# Resume an interrupted scrape
uv run scrape_allrecipes.py # Automatically resumes from state.json
# Custom output location
uv run scrape_allrecipes.py --output my_recipes.jsonl --state my_state.json| Option | Default | Description |
|---|---|---|
--limit N |
0 (unlimited) | Maximum recipes to scrape |
--output FILE |
data/recipes.jsonl |
Output file path |
--state FILE |
data/state.json |
State file for resumability |
Each line in recipes.jsonl is a JSON object:
{
"title": "Honey-Baked Spiral Ham",
"ingredients": ["1 (8 pound) spiral-cut ham", "0.5 cup honey", ...],
"instructions": "Spray the slow cooker...",
"prep_time": 10,
"cook_time": 375,
"total_time": 385,
"yields": "12 servings",
"nutrients": {"calories": "437 kcal", "protein": "56 g", ...},
"ratings": 4.8,
"image": "https://...",
"url": "https://www.allrecipes.com/recipe/...",
"scraped_at": "2025-12-22T12:29:14+00:00"
}| Recipes | Time | Storage |
|---|---|---|
| 1,000 | ~1 hour | ~3 MB |
| 10,000 | ~10 hours | ~30 MB |
| 48,000 (full) | ~47 hours | ~135 MB |
Sitemaps are cached in data/sitemaps/ to speed up subsequent runs. To refresh the sitemap cache (e.g., to discover new recipes):
rm -rf data/sitemaps/See ARCHITECTURE.md for design decisions and future improvement options (modular package, Scrapy migration, etc.).
This scraper is designed for responsible use:
- Conservative 2-5 second delays between requests
- Respects HTTP 429 rate limit responses
- Exponential backoff on errors (2s → 4s → 8s)
- Honest User-Agent with contact information
- Does not circumvent bot protection
This is free and unencumbered software released into the public domain. See LICENSE.