-
Notifications
You must be signed in to change notification settings - Fork 1
TextSpitter 2.0 Rust Roadmap
Status: Planning · Current Version: 1.0 (Python) · Target: 2.0 (Python wrapper over Rust core)
TextSpitter 1.0 is a pure Python document-to-text extraction and splitting library. Version 2.0 will rebase the core splitting engine in Rust via PyO3 and Maturin, keeping the existing Python public API fully backwards-compatible while delivering significant performance gains for batch and high-throughput workloads.
- Maintain 100% API compatibility — existing users change nothing
- Achieve 10x–40x throughput improvement on batch document processing
- Eliminate Python GIL bottlenecks for CPU-bound splitting operations
- Enable zero-copy string slicing and parallel chunk processing via Rayon
- Provide a graceful Python fallback when the Rust extension is unavailable
- TextSpitter will not perform OCR in Rust — scanned-PDF handling remains a pre-processing concern for the caller
- TextSpitter will not replace the Python-level format dispatch (
WordLoader) — only the splitting/chunking hot-path moves to Rust - No breaking changes to the public
TextSpitter(file_obj, filename)call signature
TextSpitter (Python public API — unchanged)
├── TextSpitter/__init__.py # Public entry point (TextSpitter function + __version__)
├── TextSpitter/main.py # WordLoader — format dispatcher (unchanged)
├── TextSpitter/core.py # FileExtractor — low-level reader (unchanged)
├── TextSpitter/splitters.py # NEW: thin Python wrappers over Rust splitters
└── text_spitter_rust/ # NEW: compiled Rust extension (via PyO3)
├── CharacterTextSplitter # Parallel character splitting
├── TokenTextSplitter # Fast token counting & chunking
├── RecursiveSplitter # Recursive splitting with overlap
└── BatchProcessor # Rayon-parallel batch operations
src/ # NEW: Rust source tree
├── lib.rs # PyO3 module definition
├── splitters/
│ ├── character.rs
│ ├── token.rs
│ └── recursive.rs
└── utils.rs
Cargo.toml # Rust manifest
pyproject.toml # Updated to use maturin build backend
Import path note: The existing package is
TextSpitter(capital T). The Rust extension module is namedtext_spitter_rustto avoid collision. The Python source tree keeps its current layout — only the newsplitters.pymodule and the compiled extension are added. Do not rename or move existing modules.
Before beginning development, ensure the following are installed:
# Rust toolchain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup update stable
# Maturin (builds PyO3 extensions) — install inside the project venv
uv add --dev maturin
# Verify
rustc --version
maturin --versionDev setup note:
maturin developinstalls the compiled extension into the active virtual environment. Always activate the project venv (source .venv/bin/activateoruv run) before running maturin commands, otherwise the extension is installed into the wrong environment.
Rationale: Resolve all outstanding Dependabot vulnerabilities before introducing new build infrastructure. Clean baseline, then build up.
| Severity | Package | Issue |
|---|---|---|
| HIGH | urllib3 |
Decompression-bomb safeguards bypassed; unbounded redirect chain; streaming API mishandles compressed data |
| HIGH | cryptography |
Subgroup attack via missing SECT curve validation |
| HIGH | nbconvert |
Uncontrolled search path → arbitrary code execution on Windows |
| HIGH | tornado |
Excessive logging via malformed multipart form data (dev dep) |
| MEDIUM | pypdf |
Multiple DoS vectors: infinite loops on malformed outlines, DCT images, LZWDecode/FlateDecode RAM exhaustion |
| MEDIUM | urllib3 |
Redirect controls missing in browser/Node.js contexts; redirects not disabled on PoolManager retries |
| LOW | pypdf |
Long runtimes on malformed startxref / missing /Root object |
| LOW | jupyterlab |
LaTeX links missing noopener attribute (dev dep) |
-
0.1 — Update
urllib3:uv lock --upgrade-package urllib3 -
0.2 — Update
pypdf:uv lock --upgrade-package pypdf+ full test suite (core dep) -
0.3 — Update
cryptography:uv lock --upgrade-package cryptography -
0.4 — Update dev deps:
uv lock --upgrade-package nbconvert --upgrade-package tornado --upgrade-package jupyterlab -
0.5 — Verify:
uv run pytest tests/— all 82 tests must pass. Confirm Dependabot alerts resolved.
Replace the current build backend with Maturin:
[build-system]
requires = ["maturin>=1.4,<2.0"]
build-backend = "maturin"
[tool.maturin]
python-source = "." # keep existing TextSpitter/ layout in place
features = ["pyo3/extension-module"][package]
name = "text_spitter_rust"
version = "2.0.0"
edition = "2021"
[lib]
name = "text_spitter_rust"
crate-type = ["cdylib"]
[dependencies]
# Check https://pyo3.rs for the latest stable release at implementation time.
pyo3 = { version = ">=0.21", features = ["extension-module"] }
rayon = "1.9"
[profile.release]
opt-level = 3
lto = true
codegen-units = 1No source restructuring needed. Add a single new file TextSpitter/splitters.py that houses the Python wrapper classes (see Phase 3). The existing module layout is untouched.
use pyo3::prelude::*;
mod splitters;
#[pymodule]
fn text_spitter_rust(_py: Python<'_>, m: &PyModule) -> PyResult<()> {
m.add_class::<splitters::character::CharacterTextSplitter>()?;
m.add_class::<splitters::token::TokenTextSplitter>()?;
Ok(())
}split_internal lives in a separate impl block so PyO3 does not expose it to Python.
use pyo3::prelude::*;
use rayon::prelude::*;
#[pyclass]
pub struct CharacterTextSplitter {
chunk_size: usize,
chunk_overlap: usize,
separator: String,
}
#[pymethods]
impl CharacterTextSplitter {
#[new]
fn new(chunk_size: usize, chunk_overlap: usize, separator: String) -> Self {
Self { chunk_size, chunk_overlap, separator }
}
fn split_text(&self, text: &str) -> PyResult<Vec<String>> {
Ok(self.split_internal(text))
}
fn split_texts(&self, texts: Vec<String>) -> PyResult<Vec<Vec<String>>> {
let results: Vec<Vec<String>> = texts
.par_iter()
.map(|text| self.split_internal(text))
.collect();
Ok(results)
}
}
// Private — NOT exposed to Python.
impl CharacterTextSplitter {
fn split_internal(&self, text: &str) -> Vec<String> {
let parts: Vec<&str> = text.split(&*self.separator).collect();
let mut chunks = Vec::new();
let mut current = String::new();
for part in parts {
if current.len() + part.len() > self.chunk_size && !current.is_empty() {
chunks.push(current.trim().to_string());
let overlap_start = current.len().saturating_sub(self.chunk_overlap);
current = current[overlap_start..].to_string();
}
if !current.is_empty() {
current.push_str(&self.separator);
}
current.push_str(part);
}
if !current.trim().is_empty() {
chunks.push(current.trim().to_string());
}
chunks
}
}TextSpitter/splitters.py — attempts Rust import, falls back to pure Python. The Python fallback must be fully implemented before merging.
from __future__ import annotations
try:
from text_spitter_rust import CharacterTextSplitter as _RustCharacterSplitter
_RUST_AVAILABLE = True
except ImportError:
_RUST_AVAILABLE = False
class CharacterTextSplitter:
def __init__(self, chunk_size=1000, chunk_overlap=200, separator="\n\n", use_rust=True):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.separator = separator
if _RUST_AVAILABLE and use_rust:
self._backend = _RustCharacterSplitter(chunk_size, chunk_overlap, separator)
self._use_rust = True
else:
self._backend = None
self._use_rust = False
def split_text(self, text: str) -> list[str]:
if self._use_rust:
return self._backend.split_text(text)
return self._split_text_python(text)
def split_texts(self, texts: list[str]) -> list[list[str]]:
if self._use_rust:
return self._backend.split_texts(texts)
return [self.split_text(t) for t in texts]
def _split_text_python(self, text: str) -> list[str]:
parts = text.split(self.separator)
chunks: list[str] = []
current = ""
for part in parts:
if len(current) + len(part) > self.chunk_size and current:
chunks.append(current.strip())
current = current[max(0, len(current) - self.chunk_overlap):]
if current:
current += self.separator
current += part
if current.strip():
chunks.append(current.strip())
return chunkssource .venv/bin/activate # activate venv first
maturin develop # dev build (fast compile)
maturin develop --release # optimized build
maturin build --release # build distributable wheel
pip install target/wheels/text_spitter_rust-*.whlimport pytest
from TextSpitter.splitters import CharacterTextSplitter
SAMPLE = "Lorem ipsum dolor sit amet. " * 500
def test_rust_python_parity():
assert CharacterTextSplitter(use_rust=False).split_text(SAMPLE) == \
CharacterTextSplitter(use_rust=True).split_text(SAMPLE)
def test_batch_parity():
docs = [SAMPLE] * 100
assert CharacterTextSplitter(use_rust=False).split_texts(docs) == \
CharacterTextSplitter(use_rust=True).split_texts(docs)
@pytest.mark.parametrize("text", ["", "no separator", "\n\n" * 10, "a" * 5000])
def test_edge_cases_parity(text):
assert CharacterTextSplitter(use_rust=False).split_text(text) == \
CharacterTextSplitter(use_rust=True).split_text(text)Expected at 10,000 documents: Python ~45s · Rust ~1.2s · ~37x speedup
.github/workflows/ci-rust.yml jobs:
-
lint-rust —
cargo fmt --check+cargo clippy -D warnings -
test — matrix (ubuntu/macos/windows × Python 3.10/3.11/3.12),
maturin develop --release+pytest -
build-wheels —
PyO3/maturin-action@v1withmanylinux: auto -
benchmark — runs
bench_splitting.py, uploads results as artifact
main ← stable 1.x Python releases
feature/rust-backend ← 2.0 development branch
- Memory-mapped file processing for very large PDFs (
memmap2crate) - SIMD-accelerated separator detection (opt-in
[features] simd = []) - Streaming iterator API (yield chunks vs. collect all)
- PyPI publish job for manylinux wheels on release
-
cargo benchwith criterion for reproducible micro-benchmarks