TextSpitter is a lightweight Python library that extracts text from documents and source-code files with a single call. It normalises diverse input types โ file paths, BytesIO streams, SpooledTemporaryFile objects, and raw bytes โ into plain strings, making it ideal for pipelines that feed text into LLMs, search engines, or data-processing workflows.
Why TextSpitter?
- ๐ Multi-format extraction โ PDF (PyMuPDF + PyPDF fallback), DOCX, TXT, CSV, and 50 + programming-language file types.
- ๐ Stream-first API โ accepts file paths,
BytesIO,SpooledTemporaryFile, or rawbytes; no temp files required. - ๐ ๏ธ Optional structured logging โ install
textspitter[logging]to addloguru; falls back to stdlibloggingtransparently. - ๐ฅ๏ธ CLI included โ
uv tool install textspittergives you atextspittercommand for quick one-off extractions. - ๐ Automated CI/CD โ GitHub Actions run the test matrix (Python 3.12โ3.14) and publish docs to GitHub Pages on every push.
| Component | Details | |
|---|---|---|
| โ๏ธ | Architecture |
|
| ๐ฉ | Code Quality |
|
| ๐ | Documentation |
|
| ๐ | Integrations |
|
| ๐งฉ | Modularity |
|
| ๐งช | Testing |
|
| โก๏ธ | Performance |
|
| ๐ฆ | Dependencies |
|
TextSpitter/
โโโ .github/
โ โโโ workflows/
โ โโโ docs.yml # pdoc โ GitHub Pages
โ โโโ python-publish.yml # PyPI release
โ โโโ tests.yml # pytest matrix (3.12 โ 3.14)
โโโ TextSpitter/
โ โโโ __init__.py # TextSpitter() + WordLoader public API
โ โโโ cli.py # argparse CLI entry point
โ โโโ core.py # FileExtractor class
โ โโโ logger.py # Optional loguru / stdlib fallback
โ โโโ main.py # WordLoader dispatcher
โ โโโ py.typed # PEP 561 marker
โ โโโ guide/ # pdoc documentation pages (subpackage)
โโโ tests/
โ โโโ conftest.py # shared fixtures (log_capture)
โ โโโ test_cli.py
โ โโโ test_file_extractor.py
โ โโโ test_txt.py
โ โโโ ...
โโโ CHANGELOG.md
โโโ CONTRIBUTING.md
โโโ pyproject.toml
โโโ uv.lock- Python โฅ 3.12
- uv (recommended) or pip
From PyPI:
pip install textspitter
# With optional loguru logging
pip install "textspitter[logging]"Using uv:
uv add textspitter
# With optional loguru logging
uv add "textspitter[logging]"As a standalone CLI tool:
uv tool install textspitterFrom source:
git clone https://github.com/fsecada01/TextSpitter.git
cd TextSpitter
uv sync --all-extras --devAs a library (one-liner):
from TextSpitter import TextSpitter
# From a file path
text = TextSpitter(filename="report.pdf")
print(text)
# From a BytesIO stream
from io import BytesIO
text = TextSpitter(file_obj=BytesIO(pdf_bytes), filename="report.pdf")
# From raw bytes
text = TextSpitter(file_obj=docx_bytes, filename="contract.docx")Using the WordLoader class directly:
from TextSpitter.main import WordLoader
loader = WordLoader(filename="data.csv")
text = loader.file_load()As a CLI tool:
# Extract a single file to stdout
textspitter report.pdf
# Extract multiple files and write to a combined output file
textspitter file1.pdf file2.docx notes.txt -o combined.txtuv run pytest tests/
# With coverage
uv run pytest tests/ --cov=TextSpitter --cov-report=term-missing- Stream-based API (
BytesIO,SpooledTemporaryFile, rawbytes) - CLI entry point (
uv tool install textspitter) - Optional loguru logging with stdlib fallback
- Programming-language file support (50 + extensions)
- CI matrix (Python 3.12 โ 3.14) + GitHub Pages docs
- Async extraction API
- CSV โ structured output (list of dicts)
- PPTX support
v2.0 โ Rust backend (full roadmap)
- Rust splitting core via PyO3 + Maturin โ 10xโ40x batch throughput
- Graceful Python fallback when Rust extension is unavailable
-
manylinuxwheels on PyPI โ zero-compile install for Linux users - Memory-mapped file processing for very large PDFs (
memmap2) - SIMD-accelerated string search for separator detection
- Streaming iterator API (yield chunks instead of collecting all)
- Optional SIMD feature flag (
pip install "textspitter[simd]")
- ๐ฌ Join the Discussions: Share insights, give feedback, or ask questions.
- ๐ Report Issues: Submit bugs or log feature requests.
- ๐ก Submit Pull Requests: Review open PRs or submit your own.
Contributing Guidelines
- Fork the Repository: Fork the project to your GitHub account.
- Clone Locally: Clone the forked repository.
git clone https://github.com/fsecada01/TextSpitter.git
- Create a New Branch: Always work on a new branch.
git checkout -b new-feature-x
- Make Your Changes: Develop and test your changes locally.
- Commit Your Changes: Commit with a clear message.
git commit -m 'Add new feature x.' - Push to GitHub: Push the changes to your fork.
git push origin new-feature-x
- Submit a Pull Request: Create a PR against
main. Describe the changes and motivation clearly. - Review: Once approved, your PR will be merged. Thanks for contributing!
TextSpitter is released under the MIT License.