Skip to content

lisunshiny/herbarium-processor

Repository files navigation

🌿 Parsely

Herbarium Specimen Digitization Platform

Parsely Core + Studio help herbaria, museums, and researchers digitize large volumes of specimen labels with the latest AI models in a clean, intuitive workflow.

🔗 Live demo (pre-alpha): parselystudio.com

⚠️ This demo is a pre-alpha release. Features are incomplete and downtime is expected. For stable local use, see the Getting Started section below.

✨ What it does

Given a set of specimen label images, Parsely can:

  • Preprocess images — crop, deskew, and auto-rotate to prepare for AI extraction.
  • Run OCR — call Google Vision OCR (or other engines) to extract text from images.
  • Extract structured data — send OCR + images to an LLM via OpenRouter (currently Gemini 2.5 Pro) to parse into specimen fields (e.g., catalog number, taxon, collector) according to Darwin Core schema.
  • Edit + review — provide a simple web UI for curators to view images, edit predictions, and export results to CSV.

🚀 Getting Started

1. Prerequisites

Make sure the following are installed on your system:

  • Python 3.11

  • Node.js 20 and npm

  • System packages required by OpenCV and HEIF support. On Debian/Ubuntu:

    sudo apt-get install -y libgl1 libglib2.0-0 libheif1 libde265-0

If you don’t have Poetry yet:

pip3 install poetry

2. Clone and install

git clone https://github.com/<your-user>/herbarium-processor.git
cd herbarium-processor
poetry install
cd src/herbarium_processor/web/frontend
npm ci
cd -

3. Configure environment

Create a .env file in the project root with:

OPENROUTER_API_KEY=your_openrouter_key_here
GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/credentials.json"

Optional: install pre-commit hooks (we use this to strip notebook metadata):

poetry run pre-commit install

🖥️ Usage

Option A: Web App

  1. Start the server:
    poetry run dev
  2. Open the frontend at http://localhost:5173/
  3. The API server runs at http://localhost:8000/
  4. Upload images → edit predictions → finalize CSV.
  5. Processed files are stored in /tmp.

Option B: Notebook

  1. Open notebooks/herbarium_processor.ipynb.
  2. Point it to a directory of images (img/bucket).

📜 License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See LICENSE for details. See NOTICE and COPYRIGHT for attribution and trademark information.

About

Multimodal AI-powered herbarium specimen digitizer using Gemini 2.5 Pro

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors