Skip to content

feat(stdlib): streaming event types, events() iterator, and OTEL bridge (#902)#958

Draft
planetf1 wants to merge 24 commits intogenerative-computing:mainfrom
planetf1:feat/902-streaming-events
Draft

feat(stdlib): streaming event types, events() iterator, and OTEL bridge (#902)#958
planetf1 wants to merge 24 commits intogenerative-computing:mainfrom
planetf1:feat/902-streaming-events

Conversation

@planetf1
Copy link
Copy Markdown
Contributor

@planetf1 planetf1 commented Apr 29, 2026

Misc PR

Type of PR

  • Bug Fix
  • New Feature
  • Documentation
  • Other

Description

Part of epic #891 · Wave 4 of 4 (final wave — nothing further is planned after this PR).

Adds the streaming event-type vocabulary, wires the OTEL bridge into the orchestrator, replaces the MelleaLogger.warning stopgap with ErrorEvent, and ships all narrative docs for the streaming validation epic.

What changed

  • mellea/stdlib/streaming.py — eight typed event dataclasses (StreamEvent base + ChunkEvent, QuickCheckEvent, StreamingDoneEvent, FullValidationEvent, RetryEvent, CompletedEvent, ErrorEvent) with auto-populated timestamp. StreamChunkingResult.events() single-consumer async iterator backed by an independent queue — can be consumed concurrently with astream(). Event emission wired into _orchestrate_streaming at every lifecycle point. Orchestrator wrapped in trace_application("stream_with_chunking") for an OTEL application span across the full orchestration lifetime. Metric helper calls at each emission point (record_requirement_check, record_requirement_failure, record_sampling_outcome, record_error). ErrorEvent replaces the MelleaLogger.warning stopgap in the except Exception branch. QuickCheckEvent.passed reflects per-chunk result (not cumulative state). Early-exit paths use set_span_status_error rather than a phantom RuntimeError to mark the OTEL span as failed. result.completed = False set before any event is enqueued in the exception path.

  • mellea/telemetry/tracing.py — new set_span_status_error() helper: sets ERROR status on a span without recording a phantom exception event (use this for validation failures where no exception was actually raised).

  • mellea/stdlib/__init__.py — exports all eight event types.

  • docs/docs/how-to/use-async-and-streaming.md — new "Streaming with per-chunk validation" section: leads with events() as the primary API with a full match dispatch example including case _: pass fallback; astream() shown as a secondary alternative for raw-chunk access; stream_validate tri-state table.

  • docs/docs/concepts/requirements-system.md — new "Streaming validation" section: stream_validate() semantics, tri-state PartialValidationResult, state isolation via per-clone copy, cross-link to how-to page.

  • docs/examples/streaming/streaming_chunking.py — updated to use events() API with match dispatch on all six emitted event types; case _: pass fallback for RetryEvent and future types.

Testing

  • Tests added to the respective file if code was changed
    • test/stdlib/test_streaming.py — 12 new tests (28 total, all pass): event timestamp construction, event emission order on happy path (including intra-pair QuickCheck-before-Chunk ordering), StreamingDoneEvent.full_text content, early-exit event sequence, ErrorEvent on stream_validate exception, record_requirement_check call count, record_requirement_failure args, record_sampling_outcome on success and on early-exit fail, concurrent astream() + events() consumption, no-requirements path omits FullValidationEvent, single-consumer guard raises on second drain.
  • New code has 100% coverage if code as added
  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

  • AI coding assistants used

Stacked PR

Prerequisite: #942 (Wave 3 — stream_with_chunking(), StreamChunkingResult skeleton, sentence chunker flush).

This PR stacks on feat/901-stream-with-chunking. The diff against main includes Wave 3 changes from that PR. When reviewing, focus on the eleven commits added here:

bd2a5aed fix(stdlib): address fourth-round review findings on streaming events (#902)
c6d896e6 fix(stdlib): address third-round review feedback (CancelledError finally guard, telemetry pop race fix, super().__init__() in test doubles, e2e marker, ValueError test)
b93474b2 docs: add word, paragraph, and custom chunking examples (#902)
0537272c docs: add case _ fallback to streaming_chunking example match block (#902)
ab8dc45f fix(stdlib): address code review feedback on streaming events (#902)
e4035e51 docs(stdlib): fix event dataclass docstrings to use Args not Attributes (#902)
240cd481 docs: add streaming/ to examples catalogue index (#902)
b8c11267 docs: update streaming_chunking example to events() API (#902)
9291d5e7 docs: add streaming validation sections to how-to and concepts (#902)
0f205778 test(stdlib): add event emission and OTEL bridge tests (#902)
a13d58b1 feat(stdlib): add streaming event types, events() iterator, and OTEL bridge (#902)

What to ignore (Wave 3, covered by #942)

  • mellea/core/base.pycancel_generation addition
  • mellea/stdlib/chunking.pyflush() implementation and flush override overloads
  • Wave-3 commits in mellea/stdlib/streaming.py_orchestrate_streaming orchestrator, StreamChunkingResult class skeleton, stream_with_chunking() entry point
  • test/stdlib/test_streaming.py — the 16 orchestrator tests from Wave 3

planetf1 added 17 commits April 28, 2026 20:32
Adds an async cancel_generation() method that cancels in-progress
_generate and _generate_extra tasks, drains the internal async queue
to release any blocked put() calls, closes the open telemetry span,
and sets _computed=True so the MOT is immediately usable.

Required by the stream_with_chunking() orchestrator (generative-computing#901) for clean
early-exit when a streaming requirement returns "fail".

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…enerative-computing#901)

Adds stream_with_chunking() — the core streaming orchestration
primitive that consumes a ModelOutputThunk's async stream via a
background asyncio.Task, applies a ChunkingStrategy to produce
semantic chunks, and runs stream_validate() in parallel across all
requirements at each chunk boundary.

Key behaviours:
- Early exit: if any requirement returns "fail" during streaming,
  generation is cancelled immediately via cancel_generation() and
  StreamChunkingResult.completed is set to False.
- Final validation: after natural completion, validate() is called on
  all non-failed requirements.
- Clone-per-call: requirements are cloned (copy(req)) before each
  invocation; originals are never mutated.
- String aliases: "sentence", "word", "paragraph" map to the
  corresponding ChunkingStrategy subclasses.

StreamChunkingResult exposes:
- astream() — async iterator yielding individual validated chunks
- acomplete() — await full completion including final validation
- as_thunk — wrap full_text as a computed ModelOutputThunk
- completed, full_text, final_validations, streaming_failures

Re-exports StreamChunkingResult and stream_with_chunking from
mellea.stdlib for day-to-day use.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Adds test/stdlib/test_streaming.py with 9 unit tests covering:
- Normal completion: validate() called at stream end, completed=True
- Early exit on "fail": completed=False, streaming_failures populated
- Clone isolation: originals never mutated across retries
- quick_check_backend routing: validation uses alternate backend
- Deadlock prevention: early exit with asyncio.wait_for timeout
- as_thunk correctness: value=full_text, raises before acomplete()
- astream() yields individual chunks (not accumulated text)
- No requirements: streams without validation

StreamingMockBackend subclasses Backend and feeds a fixed response
string into a MOT queue char-by-char via asyncio.create_task,
following the create_manual_mock_thunk() pattern from test_astream_mock.py.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Adds docs/examples/streaming/streaming_chunking.py demonstrating
stream_with_chunking() end-to-end: defining a custom stream_validate()
override, consuming chunks via astream(), and awaiting acomplete() to
inspect final_validations and streaming_failures.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Fixes [no_class_args] CI failure — the docs build-and-validate checker
requires __init__ parameters to be documented in the class docstring (not
__init__) per Option C convention.

Assisted-by: Claude Code
Fixes second [no_raises] CI failure — stream_with_chunking raises
ValueError for unknown chunking aliases but had no Raises: section.

Assisted-by: Claude Code
…e call

Aligns the orchestrator with the chunk-at-a-time design set out in the
generative-computing#891 epic and generative-computing#900 spec. Previously _orchestrate_streaming passed the
full accumulated text to stream_validate, and called it once per batch
of new chunks rather than once per chunk. This masked the design intent
of the chunking strategy and forced stateful requirements into the
self._seen_len workaround.

Behaviour changes:
- stream_validate is called once per complete chunk produced by the
  chunking strategy (not once per astream() iteration)
- The call receives that single chunk (not the accumulated text)
- Multiple chunks from one astream() iteration are validated in order;
  early exit on a "fail" prevents later chunks in the same batch from
  being validated or emitted
- On early exit, the failing chunk is no longer emitted to the consumer;
  consumers inspect StreamChunkingResult.streaming_failures instead
  (previous behaviour emitted whatever the current batch contained)

Test changes:
- FailAfterWordsReq now maintains a running word count on self, since
  each stream_validate call sees a one-word chunk rather than the
  growing accumulation
- New test_stream_validate_receives_individual_chunks asserts the
  per-chunk contract directly by capturing the cloned requirement and
  checking the chunks it saw

Docstring updated to describe the per-chunk contract, the in-order
validation of a batch, the non-emission of failing chunks, and the MOT
single-consumer constraint.

Assisted-by: Claude Code
Two documentation fixes following the per-chunk semantics correction:

- streaming_chunking.py: MaxSentencesReq previously counted sentence-end
  punctuation in the chunk, which worked under the old accumulated-text
  behaviour but returns at most 1 per sentence under delta semantics.
  Rewritten to increment self._count once per chunk -- the canonical
  pattern for a requirement that needs context beyond a single chunk.

- stream_with_chunking docstring: add a Note that chunks are emitted to
  the consumer only after every active validator returns for that chunk.
  A slow stream_validate (e.g. an LLM-based one) therefore adds latency
  to every chunk. The invariant preserved is that the consumer never
  sees unvalidated content; a concurrent-emission fast path may be added
  in future if a concrete use case calls for it.

Assisted-by: Claude Code
ChunkingStrategy.split() withholds the trailing fragment by design
(generative-computing#899). Previously the orchestrator discarded it — it appeared in
full_text and the final validate() saw it, but it was never yielded to
astream() consumers and never seen by stream_validate. For a response
that did not end in a chunk terminator (e.g. "Sentence one. Sentence
two." with no trailing whitespace under SentenceChunker), the last
sentence silently bypassed streaming validation.

Adds ChunkingStrategy.flush(accumulated_text) -> list[str]:
- Default in the ABC returns [] (backward-compatible — external
  chunkers retain the old discard behaviour until they opt in).
- SentenceChunker, WordChunker, ParagraphChunker each override to
  return the withheld trailing fragment as a single-element list.

_orchestrate_streaming calls chunking.flush(accumulated) after the main
loop (only when the stream ended naturally, not on early exit — a
cancelled stream's trailing fragment is by definition incomplete).
Each flushed chunk goes through the same stream_validate / emit path
as regular chunks, so the "no unvalidated content reaches the consumer"
invariant extends to the trailing fragment, and a fail on the fragment
still records a streaming failure and skips final validate().

Tests:
- 13 new chunker tests covering the default-discard behaviour and each
  built-in's flush logic (empty input, fragment-present, already-
  terminated cases).
- test_trailing_fragment_is_flushed_to_consumer: stream_validate sees
  the fragment and astream yields it.
- test_early_exit_on_trailing_fragment: fail on the flushed fragment
  propagates to streaming_failures and skips final validation.

Assisted-by: Claude Code
Addresses issues raised by independent review on top of PR generative-computing#942.

Orchestrator (mellea/stdlib/streaming.py):
- except Exception now calls mot.cancel_generation() before surfacing
  the exception to the consumer — previously the backend producer was
  left running, eventually blocking on mot._async_queue (maxsize=20).
  Cleanup failures are logged via MelleaLogger.warning with a
  TODO(generative-computing#902) marker; generative-computing#902 replaces the log with a proper ErrorEvent.
- RuntimeError catch in the astream() loop now re-raises unless
  mot.is_computed() is true, so only the documented "already computed"
  race is swallowed.
- astream() docstring now states the single-consumer contract
  explicitly; a second iteration blocks on an empty queue with no
  sentinel to deliver.
- as_thunk docstring now flags the early-exit case: cancel_generation
  forces is_computed=True without running post_processing(), so
  generation.usage and related telemetry fields may be None.

Chunker (mellea/stdlib/chunking.py):
- SentenceChunker.flush switches from .strip() to .rstrip() with a
  comment explaining why: the loop's lstrip has already removed
  leading whitespace, and trailing whitespace on a sentence fragment
  is non-semantic (consistent with split() returning sentences
  without trailing whitespace).
- ParagraphChunker.flush adds a docstring noting the deliberate
  asymmetry: paragraph fragments are returned byte-for-byte because
  internal whitespace (e.g. trailing \n of a list item) can be
  semantically meaningful.

Tests (test/stdlib/test_streaming.py):
- test_stream_validate_receives_individual_chunks now uses exact-
  match on the captured chunk list, which directly regresses if
  someone reverts to accumulated-text semantics.
- test_multiple_chunks_in_one_batch_with_mid_batch_fail: response
  fed as one large token so split() yields 4 sentences at once;
  verifies chunk 1 emits, chunk 2 fails (not emitted), chunks 3 and
  4 are neither validated nor emitted.
- test_cancel_generation_invoked_on_fail: spies on
  ModelOutputThunk.cancel_generation and asserts it was called on
  the "fail" early-exit path.
- test_exception_in_stream_validate_cancels_generation: a requirement
  that raises must cause cancel_generation to run and the exception
  to surface via astream()/acomplete() without hanging.

Telemetry observability (orchestrator-level spans, metrics, span
events) remains deferred to generative-computing#902 per the epic, which now has the
acceptance criteria updated to cover event emission, the OTEL bridge,
and the ErrorEvent type that will replace the MelleaLogger stopgap.

Assisted-by: Claude Code
Three items from the second independent review:

cancel_generation(error=) — accept an optional Exception parameter.
When the orchestrator enters the except Exception path, it now passes
the caught exception to cancel_generation() so the backend telemetry
span records the real cause via set_span_error instead of a generic
RuntimeError("Generation cancelled"). The original exception still
surfaces to the consumer via astream()/acomplete(); this is purely an
OTEL accuracy fix. Backward-compatible: the default None preserves the
previous "Generation cancelled" message for the normal fail path.

stream_with_chunking docstring — the "After the stream ends (naturally
or via early exit), validate() is called" wording overstated behaviour.
The orchestrator actually skips final validate() on early exit
(test_early_exit_on_fail verifies final_validations == []). Docstring
now correctly says final validate() runs only on natural completion.

test_exception_in_stream_validate_cancels_generation docstring — the
test fails on chunk 1 so the queue never actually fills; it verifies
the cancel-on-exception path and the no-hang guarantee but does not
directly prove the worst-case "producer blocked on full queue"
scenario. Docstring now states what it actually covers and points at
test/core/ for the cancel_generation drain logic.

Assisted-by: Claude Code
The Docs CI docstring quality gate [no_class_args]-equivalent check
requires every documented method with typed params to have an Args
section and a Returns section matching the return annotation.

SentenceChunker.flush, WordChunker.flush, and ParagraphChunker.flush
all took accumulated_text and returned list[str] without the sections.
Add both to each override, documenting each flush's specific semantics
(rstrip for sentences, whitespace-split trailing fragment for words,
byte-for-byte for paragraphs).

Assisted-by: Claude Code
- _orchestrate_streaming: add cancel_generation() in finally block so the
  backend producer is stopped even on external CancelledError (BaseException
  bypasses except Exception, leaving _generate hung on a full queue)
- cancel_generation: replace .get + del on _telemetry_span with .pop to
  prevent KeyError if two coroutines race before _computed is set
- Example and test doubles: add super().__init__() to Requirement subclasses
  so description/validation_fn/_output are always initialised
- docs/examples: fix pytest tier marker integration → e2e (Ollama example
  must be e2e per MARKERS_GUIDE; all peer examples use e2e)
- test_quick_check_backend_routing: capture clone via __copy__ intercept and
  assert all seen_backends are val_backend, not just clone-isolation check

Assisted-by: Claude Code
…bridge (generative-computing#902)

Adds eight typed event dataclasses (StreamEvent base + ChunkEvent,
QuickCheckEvent, StreamingDoneEvent, FullValidationEvent, RetryEvent,
CompletedEvent, ErrorEvent) with auto-populated timestamps.

Wires event emission into _orchestrate_streaming:
- ChunkEvent emitted per chunk passed to the consumer
- QuickCheckEvent after each stream_validate batch (pass or fail)
- StreamingDoneEvent when the raw stream ends naturally
- FullValidationEvent after the post-stream validate() calls complete
- ErrorEvent replaces the MelleaLogger.warning stopgap in the except
  branch (removes TODO(generative-computing#902) marker)
- CompletedEvent in the finally block, guaranteed on every exit path

Adds StreamChunkingResult.events() single-consumer async iterator backed
by an independent queue — can be consumed concurrently with astream().

Wraps the orchestrator in trace_application("stream_with_chunking") to
open an OTEL application span for the full orchestration lifetime. Calls
record_requirement_check, record_requirement_failure, record_sampling_outcome,
and record_error at the appropriate emission points. Uses set_span_error on
early-exit fail and on unhandled exceptions.

Exports all eight event types from mellea.stdlib.__init__.

Assisted-by: Claude Code
…mputing#902)

Ten new tests for the Wave 4 additions:

- test_stream_event_types_have_auto_timestamp: all seven event types
  auto-populate timestamp on construction
- test_event_emission_order_happy_path: full sequence (ChunkEvent,
  QuickCheckEvent, StreamingDoneEvent, FullValidationEvent,
  CompletedEvent) in correct order on a two-sentence generation
- test_streaming_done_event_carries_full_text: StreamingDoneEvent.full_text
  matches result.full_text
- test_event_emission_on_early_exit: no StreamingDoneEvent or
  FullValidationEvent, QuickCheckEvent(passed=False) present,
  CompletedEvent(success=False)
- test_error_event_on_stream_validate_exception: ErrorEvent emitted with
  correct exception_type and detail, no log warning
- test_record_requirement_check_called_per_chunk: metric helper called
  once per sentence chunk
- test_record_requirement_failure_called_on_fail: called with requirement
  class name and reason string
- test_record_sampling_outcome_success: called with "stream_with_chunking"
  and success=True
- test_record_sampling_outcome_failure_on_early_exit: called with success=False
- test_concurrent_astream_and_events: astream() and events() consumed
  concurrently via asyncio.gather without interference

26 tests total, all passing.

Assisted-by: Claude Code
…ative-computing#902)

use-async-and-streaming.md: new "Streaming with per-chunk validation"
section covering stream_with_chunking() motivation, a minimal example
with MaxSentencesReq, the stream_validate tri-state table, and both
consumption patterns (astream() and events()) with a match dispatch
example. Notes single-consumer discipline and concurrent usage.

requirements-system.md: new "Streaming validation" section explaining
stream_validate() as the streaming counterpart to validate(), the
tri-state PartialValidationResult semantics, state isolation via
per-clone copy, and a cross-link to the how-to page.

Assisted-by: Claude Code
…omputing#902)

Replaces the astream() chunk loop with an events() loop using structural
pattern matching. Shows all six emitted event types: ChunkEvent,
QuickCheckEvent (pass and fail variants), StreamingDoneEvent,
FullValidationEvent, and CompletedEvent.

Updates the module docstring to describe the events() consumption pattern.

Assisted-by: Claude Code
@github-actions github-actions Bot added the enhancement New feature or request label Apr 29, 2026
…g#902)

The streaming/ directory (introduced in Wave 3) was missing from
docs/docs/examples/index.md, causing the CI examples-catalogue check
to fail. Add an entry under Core concepts alongside async/.

Assisted-by: Claude Code
…es (generative-computing#902)

The docstring quality gate (--fail-on-quality) requires Args: sections
in class docstrings for constructor parameters. Dataclass fields are
constructor parameters, so they need Args:, not Attributes:.

The seven event subclasses (ChunkEvent, QuickCheckEvent, StreamingDoneEvent,
FullValidationEvent, RetryEvent, CompletedEvent, ErrorEvent) previously used
Attributes: which the auditor could not resolve to __init__ params.

StreamEvent keeps Attributes: for `timestamp` because it is init=False
and does not appear as a constructor parameter.

Assisted-by: Claude Code
…tive-computing#902)

- Fix QuickCheckEvent.passed to reflect per-chunk result (was using
  cumulative failed_indices set, causing false negatives on all chunks
  after the first failure)
- Replace synthetic RuntimeError objects in early-exit set_span_error
  calls with set_span_status_error helper (no phantom exception events
  in OTEL traces); add set_span_status_error to mellea/telemetry/tracing.py
- Reorder result.completed = False to top of except block so the flag
  is set before ErrorEvent is enqueued (consistent consumer observation)
- Update acomplete() Raises: docstring to reflect that Exception types
  surface via astream(), only BaseException propagates directly
- Add events() docstring note that events() itself never raises
- Add _event_queue comment noting unconditional production / opt-in consumption
- Add StreamEvent docstring note for subclassers on init=False fields
- Add RetryEvent "not emitted in v1" comment in __init__.__all__
- Fix test: move import time to module level in test_streaming.py
- Add docstring to test_unknown_chunking_alias_raises_value_error
- Rewrite how-to streaming section to lead with events() as primary API;
  demote astream()-only example to secondary; add case _: pass fallback
  to all match event: blocks

Assisted-by: Claude Code
…enerative-computing#902)

Consistent with the how-to doc; covers RetryEvent and any future types.

Assisted-by: Claude Code
…omputing#902)

Three new streaming examples alongside the existing streaming_chunking.py:

- word_chunking.py: WordChunker alias — forbidden-word detection at the
  highest granularity; O(1) set check per token, early exit on first bad word
- paragraph_chunking.py: ParagraphChunker alias — per-paragraph word-count
  gate; validates entire \n\n-delimited blocks, useful for structure/length
  checks that require full paragraph context
- custom_chunking.py: ChunkingStrategy subclass — LineChunker splitting on
  single \n; validates numbered-list output line-by-line; demonstrates
  split()+flush() extension pattern

All three verified running against granite4:micro (Ollama local).

Assisted-by: Claude Code
…lly guard, telemetry pop race fix, super().__init__() in test doubles, e2e marker, ValueError test)

- Set result.completed=False in finally block before cancel_generation() so
  external CancelledError (BaseException, bypasses except Exception) does not
  leave result.completed=True and emit a misleading CompletedEvent/metric
- Add regression test test_cancelled_task_sets_completed_false (27th test);
  documents Python 3.12 C Task cancellation-before-start behaviour and the
  asyncio.sleep(0) scheduling requirement
- Document O(n) re-scan cost in ChunkingStrategy class docstring and split()
  Args; note copy()-cloning constraint for stateful subclasses

Assisted-by: Claude Code
…generative-computing#902)

- Fix misleading StreamEvent docstring (init=False ordering explanation)
- Fix events() docstring: QuickCheckEvent fires before ChunkEvent, not after
- Add _events_consumed guard to events() for single-consumer enforcement
- Move StreamingDoneEvent emission to before flush loop (token stream is done
  regardless of flush validation outcome)
- Guard FullValidationEvent/final_validations list with list() copy to prevent
  aliasing between result attribute and event payload
- Add cancelled-task guard in acomplete() to avoid CancelledError from
  task.exception() on externally-cancelled tasks
- Switch terminal finally bookkeeping to put_nowait() to eliminate await points
  and guarantee _done.set() runs under pending CancelledError
- Add mot.is_computed() guard in except block to avoid double-cancel
- Remove inline comment from __all__ RetryEvent entry
- Fix word_chunking.py example: preserve original-case word list for LLM prompt
- Add test for no-requirements path omitting FullValidationEvent
- Fix test: assert QuickCheckEvent precedes ChunkEvent within each pair

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1
Copy link
Copy Markdown
Contributor Author

PR failure is known issue with HF/intrinsics changes. (one fix is here)

@planetf1
Copy link
Copy Markdown
Contributor Author

@nrfulton are you able to take a look. I have another or stacked behind this one. With the final batch of changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(stdlib): add standard streaming event types

1 participant