docs(stdlib): fix example for delta semantics and note validator latency

planetf1 · planetf1 · commit 35df77f0ed3d · 2026-04-28T20:32:35.000+01:00
Two documentation fixes following the per-chunk semantics correction:

- streaming_chunking.py: MaxSentencesReq previously counted sentence-end
  punctuation in the chunk, which worked under the old accumulated-text
  behaviour but returns at most 1 per sentence under delta semantics.
  Rewritten to increment self._count once per chunk -- the canonical
  pattern for a requirement that needs context beyond a single chunk.

- stream_with_chunking docstring: add a Note that chunks are emitted to
  the consumer only after every active validator returns for that chunk.
  A slow stream_validate (e.g. an LLM-based one) therefore adds latency
  to every chunk. The invariant preserved is that the consumer never
  sees unvalidated content; a concurrent-emission fast path may be added
  in future if a concrete use case calls for it.

Assisted-by: Claude Code
diff --git a/docs/examples/streaming/streaming_chunking.py b/docs/examples/streaming/streaming_chunking.py
@@ -23,7 +23,13 @@
 
 
 class MaxSentencesReq(Requirement):
-    """Fails if the model generates more than *limit* sentences mid-stream."""
+    """Fails if the model generates more than *limit* sentences mid-stream.
+
+    Each ``stream_validate`` call receives one complete sentence from the
+    :class:`~mellea.stdlib.chunking.SentenceChunker`.  The running count is
+    maintained on ``self`` — this is the standard pattern for requirements
+    that need context beyond a single chunk.
+    """
 
     def __init__(self, limit: int) -> None:
         self._limit = limit
@@ -35,8 +41,8 @@ def format_for_llm(self) -> str:
     async def stream_validate(
         self, chunk: str, *, backend: Backend, ctx: Context
     ) -> PartialValidationResult:
-        sentence_count = chunk.count(".") + chunk.count("!") + chunk.count("?")
-        if sentence_count > self._limit:
+        self._count += 1
+        if self._count > self._limit:
             return PartialValidationResult(
                 "fail",
                 reason=f"Response exceeded {self._limit} sentence limit mid-stream",
diff --git a/mellea/stdlib/streaming.py b/mellea/stdlib/streaming.py
@@ -247,6 +247,19 @@ async def stream_with_chunking(
     ``self._seen = self._seen + chunk``).  They must not read ``mot.astream()``
     directly — this orchestrator is the single consumer of the MOT stream.
 
+    Note:
+        Chunks are emitted to the consumer (via
+        :meth:`StreamChunkingResult.astream`) only after every requirement's
+        ``stream_validate`` has returned for that chunk.  A slow validator
+        (for example, one that invokes an LLM) therefore adds latency to
+        every chunk — the consumer sees a chunk at most as quickly as the
+        slowest active validator.  This trade is deliberate in v1: it
+        preserves the invariant that the consumer never sees content that
+        has not been validated, which matters for UIs displaying generated
+        text live.  A future fast-path mode that emits chunks to the
+        consumer concurrently with validation (at the cost of that
+        invariant) may be added if a concrete use case calls for it.
+
     Note:
         v1 retry is simple re-invocation of this function.  Plugin hooks
         (``SAMPLING_LOOP_START``, ``SAMPLING_REPAIR``, etc.) do not fire