TextSpitter is a lightweight Python library that extracts text from documents and source-code files with a single call. It normalises diverse input types — file paths, BytesIO streams, SpooledTemporaryFile objects, and raw bytes — into plain strings, making it ideal for pipelines that feed text into LLMs, search engines, or data-processing workflows.
Why TextSpitter?
- 📄 Multi-format extraction — PDF (PyMuPDF + PyPDF fallback), DOCX, TXT, CSV, and 50 + programming-language file types.
- 🔌 Stream-first API — accepts file paths,
BytesIO,SpooledTemporaryFile, or rawbytes; no temp files required. - 🛠️ Optional structured logging — install
textspitter[logging]to addloguru; falls back to stdlibloggingtransparently. - 🖥️ CLI included —
uv tool install textspittergives you atextspittercommand for quick one-off extractions. - 🚀 Automated CI/CD — GitHub Actions run the test matrix (Python 3.12–3.14) and publish docs to GitHub Pages on every push.
| Component | Details | |
|---|---|---|
| ⚙️ | Architecture |
|
| 🔩 | Code Quality |
|
| 📄 | Documentation |
|
| 🔌 | Integrations |
|
| 🧩 | Modularity |
|
| 🧪 | Testing |
|
| ⚡️ | Performance |
|
| 📦 | Dependencies |
|
TextSpitter/
├── .github/
│ └── workflows/
│ ├── docs.yml # pdoc → GitHub Pages
│ ├── python-publish.yml # PyPI release
│ └── tests.yml # pytest matrix (3.12 – 3.14)
├── TextSpitter/
│ ├── __init__.py # TextSpitter() + WordLoader public API
│ ├── cli.py # argparse CLI entry point
│ ├── core.py # FileExtractor class
│ ├── logger.py # Optional loguru / stdlib fallback
│ ├── main.py # WordLoader dispatcher
│ ├── py.typed # PEP 561 marker
│ └── guide/ # pdoc documentation pages (subpackage)
├── tests/
│ ├── conftest.py # shared fixtures (log_capture)
│ ├── test_cli.py
│ ├── test_file_extractor.py
│ ├── test_txt.py
│ └── ...
├── CHANGELOG.md
├── CONTRIBUTING.md
├── pyproject.toml
└── uv.lock- Python ≥ 3.12
- uv (recommended) or pip
From PyPI:
pip install textspitter
# With optional loguru logging
pip install "textspitter[logging]"Using uv:
uv add textspitter
# With optional loguru logging
uv add "textspitter[logging]"As a standalone CLI tool:
uv tool install textspitterFrom source:
git clone https://github.com/fsecada01/TextSpitter.git
cd TextSpitter
uv sync --all-extras --devAs a library (one-liner):
from TextSpitter import TextSpitter
# From a file path
text = TextSpitter(filename="report.pdf")
print(text)
# From a BytesIO stream
from io import BytesIO
text = TextSpitter(file_obj=BytesIO(pdf_bytes), filename="report.pdf")
# From raw bytes
text = TextSpitter(file_obj=docx_bytes, filename="contract.docx")Using the WordLoader class directly:
from TextSpitter.main import WordLoader
loader = WordLoader(filename="data.csv")
text = loader.file_load()As a CLI tool:
# Extract a single file to stdout
textspitter report.pdf
# Extract multiple files and write to a combined output file
textspitter file1.pdf file2.docx notes.txt -o combined.txtuv run pytest tests/
# With coverage
uv run pytest tests/ --cov=TextSpitter --cov-report=term-missing- Stream-based API (
BytesIO,SpooledTemporaryFile, rawbytes) - CLI entry point (
uv tool install textspitter) - Optional loguru logging with stdlib fallback
- Programming-language file support (50 + extensions)
- CI matrix (Python 3.12 – 3.14) + GitHub Pages docs
- Async extraction API
- CSV → structured output (list of dicts)
- PPTX support
v2.0 — Rust backend (full roadmap)
- Rust splitting core via PyO3 + Maturin — 10x–40x batch throughput
- Graceful Python fallback when Rust extension is unavailable
-
manylinuxwheels on PyPI — zero-compile install for Linux users - Memory-mapped file processing for very large PDFs (
memmap2) - SIMD-accelerated string search for separator detection
- Streaming iterator API (yield chunks instead of collecting all)
- Optional SIMD feature flag (
pip install "textspitter[simd]")
- 💬 Join the Discussions: Share insights, give feedback, or ask questions.
- 🐛 Report Issues: Submit bugs or log feature requests.
- 💡 Submit Pull Requests: Review open PRs or submit your own.
Contributing Guidelines
- Fork the Repository: Fork the project to your GitHub account.
- Clone Locally: Clone the forked repository.
git clone https://github.com/fsecada01/TextSpitter.git
- Create a New Branch: Always work on a new branch.
git checkout -b new-feature-x
- Make Your Changes: Develop and test your changes locally.
- Commit Your Changes: Commit with a clear message.
git commit -m 'Add new feature x.' - Push to GitHub: Push the changes to your fork.
git push origin new-feature-x
- Submit a Pull Request: Create a PR against
main. Describe the changes and motivation clearly. - Review: Once approved, your PR will be merged. Thanks for contributing!
TextSpitter is released under the MIT License.