TextSpitter

Transforming documents into insights, effortlessly and efficiently.

Built with the tools and technologies:

Overview

TextSpitter is a lightweight Python library that extracts text from documents and source-code files with a single call. It normalises diverse input types — file paths, BytesIO streams, SpooledTemporaryFile objects, and raw bytes — into plain strings, making it ideal for pipelines that feed text into LLMs, search engines, or data-processing workflows.

Why TextSpitter?

📄 Multi-format extraction — PDF (PyMuPDF + PyPDF fallback), DOCX, TXT, CSV, and 50 + programming-language file types.
🔌 Stream-first API — accepts file paths, BytesIO, SpooledTemporaryFile, or raw bytes; no temp files required.
🛠️ Optional structured logging — install textspitter[logging] to add loguru; falls back to stdlib logging transparently.
🖥️ CLI included — uv tool install textspitter gives you a textspitter command for quick one-off extractions.
🚀 Automated CI/CD — GitHub Actions run the test matrix (Python 3.12–3.14) and publish docs to GitHub Pages on every push.

Features

	Component	Details
⚙️	Architecture	Three-layer design: `TextSpitter` convenience function → `WordLoader` dispatcher → `FileExtractor` low-level reader OOP design enables straightforward subclassing and extension
🔩	Code Quality	Strict PEP 8 / ruff linting with black formatting Full type hints; ships a `py.typed` PEP 561 marker
📄	Documentation	API docs auto-published to GitHub Pages via pdoc Quick-start guide, tutorial, use-case examples, and recipes
🔌	Integrations	CI/CD with GitHub Actions (tests + docs + PyPI publish) Package management via `uv`; installable via `pip` or `uv tool install`
🧩	Modularity	Core `FileExtractor` separated from dispatch logic in `WordLoader` Logging abstraction in `logger.py` isolates the optional `loguru` dependency
🧪	Testing	~70 pytest tests covering all readers and input types Dual-mode log capture fixture works with or without `loguru`
⚡️	Performance	Class-level `frozenset` / `dict` constants avoid per-call allocation Stream rewind avoids re-reading large files
📦	Dependencies	Core: `pymupdf`, `pypdf`, `python-docx` Optional logging: `loguru` (`pip install textspitter[logging]`)

Project Structure

TextSpitter/
├── .github/
│   └── workflows/
│       ├── docs.yml             # pdoc → GitHub Pages
│       ├── python-publish.yml   # PyPI release
│       └── tests.yml            # pytest matrix (3.12 – 3.14)
├── TextSpitter/
│   ├── __init__.py              # TextSpitter() + WordLoader public API
│   ├── cli.py                   # argparse CLI entry point
│   ├── core.py                  # FileExtractor class
│   ├── logger.py                # Optional loguru / stdlib fallback
│   ├── main.py                  # WordLoader dispatcher
│   ├── py.typed                 # PEP 561 marker
│   └── guide/                   # pdoc documentation pages (subpackage)
├── tests/
│   ├── conftest.py              # shared fixtures (log_capture)
│   ├── test_cli.py
│   ├── test_file_extractor.py
│   ├── test_txt.py
│   └── ...
├── CHANGELOG.md
├── CONTRIBUTING.md
├── pyproject.toml
└── uv.lock

Getting Started

Prerequisites

Python ≥ 3.12
uv (recommended) or pip

Installation

From PyPI:

pip install textspitter

# With optional loguru logging
pip install "textspitter[logging]"

Using uv:

uv add textspitter

# With optional loguru logging
uv add "textspitter[logging]"

As a standalone CLI tool:

uv tool install textspitter

From source:

git clone https://github.com/fsecada01/TextSpitter.git
cd TextSpitter
uv sync --all-extras --dev

Usage

As a library (one-liner):

from TextSpitter import TextSpitter

# From a file path
text = TextSpitter(filename="report.pdf")
print(text)

# From a BytesIO stream
from io import BytesIO
text = TextSpitter(file_obj=BytesIO(pdf_bytes), filename="report.pdf")

# From raw bytes
text = TextSpitter(file_obj=docx_bytes, filename="contract.docx")

Using the WordLoader class directly:

from TextSpitter.main import WordLoader

loader = WordLoader(filename="data.csv")
text = loader.file_load()

As a CLI tool:

# Extract a single file to stdout
textspitter report.pdf

# Extract multiple files and write to a combined output file
textspitter file1.pdf file2.docx notes.txt -o combined.txt

Testing

uv run pytest tests/

# With coverage
uv run pytest tests/ --cov=TextSpitter --cov-report=term-missing

Roadmap

v1.x (current)

Stream-based API (BytesIO, SpooledTemporaryFile, raw bytes)
CLI entry point (uv tool install textspitter)
Optional loguru logging with stdlib fallback
Programming-language file support (50 + extensions)
CI matrix (Python 3.12 – 3.14) + GitHub Pages docs
Async extraction API
CSV → structured output (list of dicts)
PPTX support

v2.0 — Rust backend (full roadmap)

Rust splitting core via PyO3 + Maturin — 10x–40x batch throughput
Graceful Python fallback when Rust extension is unavailable
manylinux wheels on PyPI — zero-compile install for Linux users
Memory-mapped file processing for very large PDFs (memmap2)
SIMD-accelerated string search for separator detection
Streaming iterator API (yield chunks instead of collecting all)
Optional SIMD feature flag (pip install "textspitter[simd]")

Contributing

💬 Join the Discussions: Share insights, give feedback, or ask questions.
🐛 Report Issues: Submit bugs or log feature requests.
💡 Submit Pull Requests: Review open PRs or submit your own.

Contributing Guidelines

Fork the Repository: Fork the project to your GitHub account.

Clone Locally: Clone the forked repository.

git clone https://github.com/fsecada01/TextSpitter.git

Create a New Branch: Always work on a new branch.
```
git checkout -b new-feature-x
```
Make Your Changes: Develop and test your changes locally.
Commit Your Changes: Commit with a clear message.
```
git commit -m 'Add new feature x.'
```
Push to GitHub: Push the changes to your fork.
```
git push origin new-feature-x
```
Submit a Pull Request: Create a PR against main. Describe the changes and motivation clearly.
Review: Once approved, your PR will be merged. Thanks for contributing!

Contributor Graph

License

TextSpitter is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.claude		.claude
.github		.github
.idea		.idea
TextSpitter		TextSpitter
docs		docs
tests		tests
.coverage		.coverage
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
readme_backup		readme_backup
setup_py.backup		setup_py.backup
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TextSpitter

Table of Contents

Overview

Features

Project Structure

Getting Started

Prerequisites

Installation

Usage

Testing

Roadmap

v1.x (current)

v2.0 — Rust backend (full roadmap)

Contributing

License

About

Uh oh!

Releases 8

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TextSpitter

Table of Contents

Overview

Features

Project Structure

Getting Started

Prerequisites

Installation

Usage

Testing

Roadmap

v1.x (current)

v2.0 — Rust backend (full roadmap)

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages