Skip to content

TextSpitter 2.0 Rust Roadmap

fsecada01 edited this page Feb 18, 2026 · 1 revision

TextSpitter 2.0 — Rust Backend Roadmap

Status: Planning · Current Version: 1.0 (Python) · Target: 2.0 (Python wrapper over Rust core)

TextSpitter 1.0 is a pure Python document-to-text extraction and splitting library. Version 2.0 will rebase the core splitting engine in Rust via PyO3 and Maturin, keeping the existing Python public API fully backwards-compatible while delivering significant performance gains for batch and high-throughput workloads.


Goals

  • Maintain 100% API compatibility — existing users change nothing
  • Achieve 10x–40x throughput improvement on batch document processing
  • Eliminate Python GIL bottlenecks for CPU-bound splitting operations
  • Enable zero-copy string slicing and parallel chunk processing via Rayon
  • Provide a graceful Python fallback when the Rust extension is unavailable

Non-Goals

  • TextSpitter will not perform OCR in Rust — scanned-PDF handling remains a pre-processing concern for the caller
  • TextSpitter will not replace the Python-level format dispatch (WordLoader) — only the splitting/chunking hot-path moves to Rust
  • No breaking changes to the public TextSpitter(file_obj, filename) call signature

Proposed Architecture

TextSpitter (Python public API — unchanged)
    ├── TextSpitter/__init__.py    # Public entry point (TextSpitter function + __version__)
    ├── TextSpitter/main.py        # WordLoader — format dispatcher (unchanged)
    ├── TextSpitter/core.py        # FileExtractor — low-level reader (unchanged)
    ├── TextSpitter/splitters.py   # NEW: thin Python wrappers over Rust splitters
    └── text_spitter_rust/         # NEW: compiled Rust extension (via PyO3)
         ├── CharacterTextSplitter # Parallel character splitting
         ├── TokenTextSplitter     # Fast token counting & chunking
         ├── RecursiveSplitter     # Recursive splitting with overlap
         └── BatchProcessor        # Rayon-parallel batch operations

src/                               # NEW: Rust source tree
    ├── lib.rs                     # PyO3 module definition
    ├── splitters/
    │   ├── character.rs
    │   ├── token.rs
    │   └── recursive.rs
    └── utils.rs

Cargo.toml                         # Rust manifest
pyproject.toml                     # Updated to use maturin build backend

Import path note: The existing package is TextSpitter (capital T). The Rust extension module is named text_spitter_rust to avoid collision. The Python source tree keeps its current layout — only the new splitters.py module and the compiled extension are added. Do not rename or move existing modules.


Prerequisites

Before beginning development, ensure the following are installed:

# Rust toolchain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup update stable

# Maturin (builds PyO3 extensions) — install inside the project venv
uv add --dev maturin

# Verify
rustc --version
maturin --version

Dev setup note: maturin develop installs the compiled extension into the active virtual environment. Always activate the project venv (source .venv/bin/activate or uv run) before running maturin commands, otherwise the extension is installed into the wrong environment.


Phase 0: Security Cleanup (Pre-requisite)

Rationale: Resolve all outstanding Dependabot vulnerabilities before introducing new build infrastructure. Clean baseline, then build up.

Affected packages

Severity Package Issue
HIGH urllib3 Decompression-bomb safeguards bypassed; unbounded redirect chain; streaming API mishandles compressed data
HIGH cryptography Subgroup attack via missing SECT curve validation
HIGH nbconvert Uncontrolled search path → arbitrary code execution on Windows
HIGH tornado Excessive logging via malformed multipart form data (dev dep)
MEDIUM pypdf Multiple DoS vectors: infinite loops on malformed outlines, DCT images, LZWDecode/FlateDecode RAM exhaustion
MEDIUM urllib3 Redirect controls missing in browser/Node.js contexts; redirects not disabled on PoolManager retries
LOW pypdf Long runtimes on malformed startxref / missing /Root object
LOW jupyterlab LaTeX links missing noopener attribute (dev dep)

Steps

  • 0.1 — Update urllib3: uv lock --upgrade-package urllib3
  • 0.2 — Update pypdf: uv lock --upgrade-package pypdf + full test suite (core dep)
  • 0.3 — Update cryptography: uv lock --upgrade-package cryptography
  • 0.4 — Update dev deps: uv lock --upgrade-package nbconvert --upgrade-package tornado --upgrade-package jupyterlab
  • 0.5 — Verify: uv run pytest tests/ — all 82 tests must pass. Confirm Dependabot alerts resolved.

Phase 1: Project Setup & Build Infrastructure

1.1 — Update pyproject.toml

Replace the current build backend with Maturin:

[build-system]
requires = ["maturin>=1.4,<2.0"]
build-backend = "maturin"

[tool.maturin]
python-source = "."          # keep existing TextSpitter/ layout in place
features = ["pyo3/extension-module"]

1.2 — Create Cargo.toml

[package]
name = "text_spitter_rust"
version = "2.0.0"
edition = "2021"

[lib]
name = "text_spitter_rust"
crate-type = ["cdylib"]

[dependencies]
# Check https://pyo3.rs for the latest stable release at implementation time.
pyo3 = { version = ">=0.21", features = ["extension-module"] }
rayon = "1.9"

[profile.release]
opt-level = 3
lto = true
codegen-units = 1

1.3 — Add TextSpitter/splitters.py

No source restructuring needed. Add a single new file TextSpitter/splitters.py that houses the Python wrapper classes (see Phase 3). The existing module layout is untouched.


Phase 2: Core Rust Implementation

2.1 — PyO3 Module Entry Point (src/lib.rs)

use pyo3::prelude::*;

mod splitters;

#[pymodule]
fn text_spitter_rust(_py: Python<'_>, m: &PyModule) -> PyResult<()> {
    m.add_class::<splitters::character::CharacterTextSplitter>()?;
    m.add_class::<splitters::token::TokenTextSplitter>()?;
    Ok(())
}

2.2 — Character Splitter (src/splitters/character.rs)

split_internal lives in a separate impl block so PyO3 does not expose it to Python.

use pyo3::prelude::*;
use rayon::prelude::*;

#[pyclass]
pub struct CharacterTextSplitter {
    chunk_size: usize,
    chunk_overlap: usize,
    separator: String,
}

#[pymethods]
impl CharacterTextSplitter {
    #[new]
    fn new(chunk_size: usize, chunk_overlap: usize, separator: String) -> Self {
        Self { chunk_size, chunk_overlap, separator }
    }

    fn split_text(&self, text: &str) -> PyResult<Vec<String>> {
        Ok(self.split_internal(text))
    }

    fn split_texts(&self, texts: Vec<String>) -> PyResult<Vec<Vec<String>>> {
        let results: Vec<Vec<String>> = texts
            .par_iter()
            .map(|text| self.split_internal(text))
            .collect();
        Ok(results)
    }
}

// Private — NOT exposed to Python.
impl CharacterTextSplitter {
    fn split_internal(&self, text: &str) -> Vec<String> {
        let parts: Vec<&str> = text.split(&*self.separator).collect();
        let mut chunks = Vec::new();
        let mut current = String::new();

        for part in parts {
            if current.len() + part.len() > self.chunk_size && !current.is_empty() {
                chunks.push(current.trim().to_string());
                let overlap_start = current.len().saturating_sub(self.chunk_overlap);
                current = current[overlap_start..].to_string();
            }
            if !current.is_empty() {
                current.push_str(&self.separator);
            }
            current.push_str(part);
        }

        if !current.trim().is_empty() {
            chunks.push(current.trim().to_string());
        }
        chunks
    }
}

Phase 3: Python Wrapper with Graceful Fallback

TextSpitter/splitters.py — attempts Rust import, falls back to pure Python. The Python fallback must be fully implemented before merging.

from __future__ import annotations

try:
    from text_spitter_rust import CharacterTextSplitter as _RustCharacterSplitter
    _RUST_AVAILABLE = True
except ImportError:
    _RUST_AVAILABLE = False


class CharacterTextSplitter:
    def __init__(self, chunk_size=1000, chunk_overlap=200, separator="\n\n", use_rust=True):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.separator = separator
        if _RUST_AVAILABLE and use_rust:
            self._backend = _RustCharacterSplitter(chunk_size, chunk_overlap, separator)
            self._use_rust = True
        else:
            self._backend = None
            self._use_rust = False

    def split_text(self, text: str) -> list[str]:
        if self._use_rust:
            return self._backend.split_text(text)
        return self._split_text_python(text)

    def split_texts(self, texts: list[str]) -> list[list[str]]:
        if self._use_rust:
            return self._backend.split_texts(texts)
        return [self.split_text(t) for t in texts]

    def _split_text_python(self, text: str) -> list[str]:
        parts = text.split(self.separator)
        chunks: list[str] = []
        current = ""
        for part in parts:
            if len(current) + len(part) > self.chunk_size and current:
                chunks.append(current.strip())
                current = current[max(0, len(current) - self.chunk_overlap):]
            if current:
                current += self.separator
            current += part
        if current.strip():
            chunks.append(current.strip())
        return chunks

Phase 4: Build & Local Development

source .venv/bin/activate   # activate venv first

maturin develop              # dev build (fast compile)
maturin develop --release    # optimized build
maturin build --release      # build distributable wheel
pip install target/wheels/text_spitter_rust-*.whl

Phase 5: Testing Strategy

Parity tests (tests/test_parity.py)

import pytest
from TextSpitter.splitters import CharacterTextSplitter

SAMPLE = "Lorem ipsum dolor sit amet. " * 500

def test_rust_python_parity():
    assert CharacterTextSplitter(use_rust=False).split_text(SAMPLE) == \
           CharacterTextSplitter(use_rust=True).split_text(SAMPLE)

def test_batch_parity():
    docs = [SAMPLE] * 100
    assert CharacterTextSplitter(use_rust=False).split_texts(docs) == \
           CharacterTextSplitter(use_rust=True).split_texts(docs)

@pytest.mark.parametrize("text", ["", "no separator", "\n\n" * 10, "a" * 5000])
def test_edge_cases_parity(text):
    assert CharacterTextSplitter(use_rust=False).split_text(text) == \
           CharacterTextSplitter(use_rust=True).split_text(text)

Benchmark (tests/bench_splitting.py)

Expected at 10,000 documents: Python ~45s · Rust ~1.2s · ~37x speedup


Phase 6: CI/CD — GitHub Actions

.github/workflows/ci-rust.yml jobs:

  1. lint-rustcargo fmt --check + cargo clippy -D warnings
  2. test — matrix (ubuntu/macos/windows × Python 3.10/3.11/3.12), maturin develop --release + pytest
  3. build-wheelsPyO3/maturin-action@v1 with manylinux: auto
  4. benchmark — runs bench_splitting.py, uploads results as artifact

Branch Strategy

main                  ← stable 1.x Python releases
feature/rust-backend  ← 2.0 development branch

Nice-to-Haves

  • Memory-mapped file processing for very large PDFs (memmap2 crate)
  • SIMD-accelerated separator detection (opt-in [features] simd = [])
  • Streaming iterator API (yield chunks vs. collect all)
  • PyPI publish job for manylinux wheels on release
  • cargo bench with criterion for reproducible micro-benchmarks

References