Recipe Scraper

Ethical web scraper for collecting recipes from allrecipes.com.

Features

Sitemap-based discovery - Finds all ~48,000 recipe URLs automatically
Sitemap caching - Sitemaps cached locally for faster subsequent runs
Ethical rate limiting - 2-5 second random delays between requests
Full resumability - Interrupt anytime with Ctrl+C, resume where you left off
Rich data extraction - Title, ingredients, instructions, nutrition, ratings, and more
JSONL output - One recipe per line, easy to process incrementally

Installation

Requires uv and Python 3.10+.

git clone https://github.com/YOUR_USERNAME/recipe-scraper.git
cd recipe-scraper
uv sync

Usage

# Test with a few recipes
uv run scrape_allrecipes.py --limit 10

# Full scrape (will take ~47 hours)
uv run scrape_allrecipes.py

# Resume an interrupted scrape
uv run scrape_allrecipes.py  # Automatically resumes from state.json

# Custom output location
uv run scrape_allrecipes.py --output my_recipes.jsonl --state my_state.json

CLI Options

Option	Default	Description
`--limit N`	0 (unlimited)	Maximum recipes to scrape
`--output FILE`	`data/recipes.jsonl`	Output file path
`--state FILE`	`data/state.json`	State file for resumability

Output Format

Each line in recipes.jsonl is a JSON object:

{
  "title": "Honey-Baked Spiral Ham",
  "ingredients": ["1 (8 pound) spiral-cut ham", "0.5 cup honey", ...],
  "instructions": "Spray the slow cooker...",
  "prep_time": 10,
  "cook_time": 375,
  "total_time": 385,
  "yields": "12 servings",
  "nutrients": {"calories": "437 kcal", "protein": "56 g", ...},
  "ratings": 4.8,
  "image": "https://...",
  "url": "https://www.allrecipes.com/recipe/...",
  "scraped_at": "2025-12-22T12:29:14+00:00"
}

Estimated Times & Storage

Recipes	Time	Storage
1,000	~1 hour	~3 MB
10,000	~10 hours	~30 MB
48,000 (full)	~47 hours	~135 MB

Cache Management

Sitemaps are cached in data/sitemaps/ to speed up subsequent runs. To refresh the sitemap cache (e.g., to discover new recipes):

rm -rf data/sitemaps/

Architecture

See ARCHITECTURE.md for design decisions and future improvement options (modular package, Scrapy migration, etc.).

Ethical Scraping

This scraper is designed for responsible use:

Conservative 2-5 second delays between requests
Respects HTTP 429 rate limit responses
Exponential backoff on errors (2s → 4s → 8s)
Honest User-Agent with contact information
Does not circumvent bot protection

License

This is free and unencumbered software released into the public domain. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
scrape_allrecipes.py		scrape_allrecipes.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Recipe Scraper

Features

Installation

Usage

CLI Options

Output Format

Estimated Times & Storage

Cache Management

Architecture

Ethical Scraping

License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Recipe Scraper

Features

Installation

Usage

CLI Options

Output Format

Estimated Times & Storage

Cache Management

Architecture

Ethical Scraping

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages