Herbarium Specimen Digitization Platform
Parsely Core + Studio help herbaria, museums, and researchers digitize large volumes of specimen labels with the latest AI models in a clean, intuitive workflow.
🔗 Live demo (pre-alpha): parselystudio.com
⚠️ This demo is a pre-alpha release. Features are incomplete and downtime is expected. For stable local use, see the Getting Started section below.
Given a set of specimen label images, Parsely can:
- Preprocess images — crop, deskew, and auto-rotate to prepare for AI extraction.
- Run OCR — call Google Vision OCR (or other engines) to extract text from images.
- Extract structured data — send OCR + images to an LLM via OpenRouter (currently Gemini 2.5 Pro) to parse into specimen fields (e.g., catalog number, taxon, collector) according to Darwin Core schema.
- Edit + review — provide a simple web UI for curators to view images, edit predictions, and export results to CSV.
Make sure the following are installed on your system:
-
Python 3.11
-
Node.js 20 and
npm -
System packages required by OpenCV and HEIF support. On Debian/Ubuntu:
sudo apt-get install -y libgl1 libglib2.0-0 libheif1 libde265-0
If you don’t have Poetry yet:
pip3 install poetrygit clone https://github.com/<your-user>/herbarium-processor.git
cd herbarium-processor
poetry install
cd src/herbarium_processor/web/frontend
npm ci
cd -Create a .env file in the project root with:
OPENROUTER_API_KEY=your_openrouter_key_here
GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/credentials.json"Optional: install pre-commit hooks (we use this to strip notebook metadata):
poetry run pre-commit install- Start the server:
poetry run dev
- Open the frontend at http://localhost:5173/
- The API server runs at http://localhost:8000/
- Upload images → edit predictions → finalize CSV.
- Processed files are stored in
/tmp.
- Open
notebooks/herbarium_processor.ipynb. - Point it to a directory of images (
img/bucket).
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See LICENSE for details. See NOTICE and COPYRIGHT for attribution and trademark information.