Skip to content

FlorinPopaCodes/recipe-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Recipe Scraper

Ethical web scraper for collecting recipes from allrecipes.com.

Features

  • Sitemap-based discovery - Finds all ~48,000 recipe URLs automatically
  • Sitemap caching - Sitemaps cached locally for faster subsequent runs
  • Ethical rate limiting - 2-5 second random delays between requests
  • Full resumability - Interrupt anytime with Ctrl+C, resume where you left off
  • Rich data extraction - Title, ingredients, instructions, nutrition, ratings, and more
  • JSONL output - One recipe per line, easy to process incrementally

Installation

Requires uv and Python 3.10+.

git clone https://github.com/YOUR_USERNAME/recipe-scraper.git
cd recipe-scraper
uv sync

Usage

# Test with a few recipes
uv run scrape_allrecipes.py --limit 10

# Full scrape (will take ~47 hours)
uv run scrape_allrecipes.py

# Resume an interrupted scrape
uv run scrape_allrecipes.py  # Automatically resumes from state.json

# Custom output location
uv run scrape_allrecipes.py --output my_recipes.jsonl --state my_state.json

CLI Options

Option Default Description
--limit N 0 (unlimited) Maximum recipes to scrape
--output FILE data/recipes.jsonl Output file path
--state FILE data/state.json State file for resumability

Output Format

Each line in recipes.jsonl is a JSON object:

{
  "title": "Honey-Baked Spiral Ham",
  "ingredients": ["1 (8 pound) spiral-cut ham", "0.5 cup honey", ...],
  "instructions": "Spray the slow cooker...",
  "prep_time": 10,
  "cook_time": 375,
  "total_time": 385,
  "yields": "12 servings",
  "nutrients": {"calories": "437 kcal", "protein": "56 g", ...},
  "ratings": 4.8,
  "image": "https://...",
  "url": "https://www.allrecipes.com/recipe/...",
  "scraped_at": "2025-12-22T12:29:14+00:00"
}

Estimated Times & Storage

Recipes Time Storage
1,000 ~1 hour ~3 MB
10,000 ~10 hours ~30 MB
48,000 (full) ~47 hours ~135 MB

Cache Management

Sitemaps are cached in data/sitemaps/ to speed up subsequent runs. To refresh the sitemap cache (e.g., to discover new recipes):

rm -rf data/sitemaps/

Architecture

See ARCHITECTURE.md for design decisions and future improvement options (modular package, Scrapy migration, etc.).

Ethical Scraping

This scraper is designed for responsible use:

  • Conservative 2-5 second delays between requests
  • Respects HTTP 429 rate limit responses
  • Exponential backoff on errors (2s → 4s → 8s)
  • Honest User-Agent with contact information
  • Does not circumvent bot protection

License

This is free and unencumbered software released into the public domain. See LICENSE.

About

Ethical web scraper for allrecipes.com recipes

Resources

License

Stars

Watchers

Forks

Contributors

Languages