Skip to content

Latest commit

 

History

History
276 lines (206 loc) · 10.1 KB

File metadata and controls

276 lines (206 loc) · 10.1 KB

TextSpitter

Transforming documents into insights, effortlessly and efficiently.

license last-commit repo-top-language repo-language-count docs

Built with the tools and technologies:

TOML Pytest Python GitHub%20Actions uv


Table of Contents


Overview

TextSpitter is a lightweight Python library that extracts text from documents and source-code files with a single call. It normalises diverse input types — file paths, BytesIO streams, SpooledTemporaryFile objects, and raw bytes — into plain strings, making it ideal for pipelines that feed text into LLMs, search engines, or data-processing workflows.

Why TextSpitter?

  • 📄 Multi-format extraction — PDF (PyMuPDF + PyPDF fallback), DOCX, TXT, CSV, and 50 + programming-language file types.
  • 🔌 Stream-first API — accepts file paths, BytesIO, SpooledTemporaryFile, or raw bytes; no temp files required.
  • 🛠️ Optional structured logging — install textspitter[logging] to add loguru; falls back to stdlib logging transparently.
  • 🖥️ CLI includeduv tool install textspitter gives you a textspitter command for quick one-off extractions.
  • 🚀 Automated CI/CD — GitHub Actions run the test matrix (Python 3.12–3.14) and publish docs to GitHub Pages on every push.

Features

Component Details
⚙️ Architecture
  • Three-layer design: TextSpitter convenience function → WordLoader dispatcher → FileExtractor low-level reader
  • OOP design enables straightforward subclassing and extension
🔩 Code Quality
  • Strict PEP 8 / ruff linting with black formatting
  • Full type hints; ships a py.typed PEP 561 marker
📄 Documentation
  • API docs auto-published to GitHub Pages via pdoc
  • Quick-start guide, tutorial, use-case examples, and recipes
🔌 Integrations
  • CI/CD with GitHub Actions (tests + docs + PyPI publish)
  • Package management via uv; installable via pip or uv tool install
🧩 Modularity
  • Core FileExtractor separated from dispatch logic in WordLoader
  • Logging abstraction in logger.py isolates the optional loguru dependency
🧪 Testing
  • ~70 pytest tests covering all readers and input types
  • Dual-mode log capture fixture works with or without loguru
⚡️ Performance
  • Class-level frozenset / dict constants avoid per-call allocation
  • Stream rewind avoids re-reading large files
📦 Dependencies
  • Core: pymupdf, pypdf, python-docx
  • Optional logging: loguru (pip install textspitter[logging])

Project Structure

TextSpitter/
├── .github/
│   └── workflows/
│       ├── docs.yml             # pdoc → GitHub Pages
│       ├── python-publish.yml   # PyPI release
│       └── tests.yml            # pytest matrix (3.12 – 3.14)
├── TextSpitter/
│   ├── __init__.py              # TextSpitter() + WordLoader public API
│   ├── cli.py                   # argparse CLI entry point
│   ├── core.py                  # FileExtractor class
│   ├── logger.py                # Optional loguru / stdlib fallback
│   ├── main.py                  # WordLoader dispatcher
│   ├── py.typed                 # PEP 561 marker
│   └── guide/                   # pdoc documentation pages (subpackage)
├── tests/
│   ├── conftest.py              # shared fixtures (log_capture)
│   ├── test_cli.py
│   ├── test_file_extractor.py
│   ├── test_txt.py
│   └── ...
├── CHANGELOG.md
├── CONTRIBUTING.md
├── pyproject.toml
└── uv.lock

Getting Started

Prerequisites

  • Python ≥ 3.12
  • uv (recommended) or pip

Installation

From PyPI:

pip install textspitter

# With optional loguru logging
pip install "textspitter[logging]"

Using uv:

uv add textspitter

# With optional loguru logging
uv add "textspitter[logging]"

As a standalone CLI tool:

uv tool install textspitter

From source:

git clone https://github.com/fsecada01/TextSpitter.git
cd TextSpitter
uv sync --all-extras --dev

Usage

As a library (one-liner):

from TextSpitter import TextSpitter

# From a file path
text = TextSpitter(filename="report.pdf")
print(text)

# From a BytesIO stream
from io import BytesIO
text = TextSpitter(file_obj=BytesIO(pdf_bytes), filename="report.pdf")

# From raw bytes
text = TextSpitter(file_obj=docx_bytes, filename="contract.docx")

Using the WordLoader class directly:

from TextSpitter.main import WordLoader

loader = WordLoader(filename="data.csv")
text = loader.file_load()

As a CLI tool:

# Extract a single file to stdout
textspitter report.pdf

# Extract multiple files and write to a combined output file
textspitter file1.pdf file2.docx notes.txt -o combined.txt

Testing

uv run pytest tests/

# With coverage
uv run pytest tests/ --cov=TextSpitter --cov-report=term-missing

Roadmap

v1.x (current)

  • Stream-based API (BytesIO, SpooledTemporaryFile, raw bytes)
  • CLI entry point (uv tool install textspitter)
  • Optional loguru logging with stdlib fallback
  • Programming-language file support (50 + extensions)
  • CI matrix (Python 3.12 – 3.14) + GitHub Pages docs
  • Async extraction API
  • CSV → structured output (list of dicts)
  • PPTX support

v2.0 — Rust backend (full roadmap)

  • Rust splitting core via PyO3 + Maturin — 10x–40x batch throughput
  • Graceful Python fallback when Rust extension is unavailable
  • manylinux wheels on PyPI — zero-compile install for Linux users
  • Memory-mapped file processing for very large PDFs (memmap2)
  • SIMD-accelerated string search for separator detection
  • Streaming iterator API (yield chunks instead of collecting all)
  • Optional SIMD feature flag (pip install "textspitter[simd]")

Contributing

Contributing Guidelines
  1. Fork the Repository: Fork the project to your GitHub account.
  2. Clone Locally: Clone the forked repository.
    git clone https://github.com/fsecada01/TextSpitter.git
  3. Create a New Branch: Always work on a new branch.
    git checkout -b new-feature-x
  4. Make Your Changes: Develop and test your changes locally.
  5. Commit Your Changes: Commit with a clear message.
    git commit -m 'Add new feature x.'
  6. Push to GitHub: Push the changes to your fork.
    git push origin new-feature-x
  7. Submit a Pull Request: Create a PR against main. Describe the changes and motivation clearly.
  8. Review: Once approved, your PR will be merged. Thanks for contributing!
Contributor Graph


License

TextSpitter is released under the MIT License.