Skip to content

feat: context compaction strategies for the react loop#996

Open
yelkurdi wants to merge 4 commits intogenerative-computing:mainfrom
yelkurdi:context_compaction_for_react
Open

feat: context compaction strategies for the react loop#996
yelkurdi wants to merge 4 commits intogenerative-computing:mainfrom
yelkurdi:context_compaction_for_react

Conversation

@yelkurdi
Copy link
Copy Markdown
Contributor

@yelkurdi yelkurdi commented May 1, 2026

Component PR

Use this template when adding or modifying components in mellea/stdlib/components/.

Description

  • Link to Issue: Fixes

Implementation Checklist

Protocol Compliance

  • parts() returns list of constituent parts (Components or CBlocks)
  • format_for_llm() returns TemplateRepresentation or string
  • _parse(computed: ModelOutputThunk) parses model output correctly into the specified Component return type

Content Blocks

  • CBlock used appropriately for text content
  • ImageBlock used for image content (if applicable)

Integration

  • Component exported in mellea/stdlib/components/__init__.py or, if you are adding a library of components, from your sub-module

Testing

  • Tests added to tests/components/
  • New code has 100% coverage
  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

  • AI coding assistants used

Summary

Adds an optional CompactionStrategy to mellea.stdlib.frameworks.react with three concrete implementations (ClearAll, KeepLastN, LLMSummarize) under a new module mellea.stdlib.compaction. Strategies fire when the running context's token count crosses a configurable threshold, measured from the provider-reported usage on the last ModelOutputThunk.

Empirically on the BCP benchmark with Granite 4.1-8b, llm_summarize cuts inference cost by 23.7% and raises accuracy by 3.5 pp — compaction is a dual win, not a quality/cost trade-off.

Backwards compatible: compaction=None (default) preserves existing react() behavior exactly.

Motivation

Long agentic loops — especially retrieval-heavy ones — pile up tool responses. Each react iteration re-sends the full history to the model, so prompt-token cost grows quadratically and the loop can exhaust the model's context window before reaching a final answer. Compaction trims that history, lowering both the dollar cost of inference and the likelihood of hitting the context / timeout wall.

On the BrowseCompPlus (BCP) benchmark with Granite 4.1-8b (131K context, 830 questions, loop_budget=400, per_question_timeout_s=1800, averaged across 3 runs per strategy):

Compaction Est. cost @ 80% cache hit Accuracy Mean wall-clock per Q
none $1801.0M 15.6% 724 s
clear_all $1499.2M (−16.8%) 18.5% (+2.9 pp) 980 s
llm_summarize $1373.9M (−23.7%) 19.1% (+3.5 pp) 709 s
  • Without compaction, inference cost on the 830-Q set is ~$1801M under an 80%-cache-hit pricing model, because each react iteration re-sends an ever-larger prompt (history + tool outputs) — the prompt-token total balloons to 5.4B before the 1800 s wall / context limit stops further progress.
  • llm_summarize is the clear winner: it cuts inference cost by −23.7% ($1801M → $1374M) and raises accuracy from 15.6% to 19.1% (+3.5 pp, +23% relative), at similar mean wall-clock. The summary shrinks the prompt enough that each iteration is actually faster, so both cost and quality improve.
  • clear_all reduces cost by −16.8% and lifts accuracy +2.9 pp, but its +35% wall-clock makes it less attractive than llm_summarize when the judge budget allows the extra summarization call.

Hardware / infrastructure for the table above:

  • Model: ibm-granite/granite-4.1-8b (bf16, native 131K context)
  • Node: single 8× NVIDIA H100 80 GiB host (IBM LSF cluster, exclusive GPU allocation)
  • Inference: vLLM 0.19.1, 7 instances × TP=1 (1 GPU each), with --enable-prefix-caching on (Granite-family default). GPU 0 reserved for the BCP local search service (tevatron + BM25 over the BCP corpus).
  • Agent: Mellea react loop via OpenAIBackend → local vLLM, concurrency=56 (8 × num_vllm), loop_budget=400, per_question_timeout_s=1800. Threshold: --compaction_threshold 50000 tokens (~38% of 131K), --compaction_keep_n 5 where applicable.
  • Questions: all 830 from the BCP parquet, decrypted at load.
  • Judge: openai/gpt-oss-120b on the same node after the agent phase teardown. "Correct" = judge verdict matches the reference answer.
  • 3 runs per strategy, independently seeded; numbers in the table are means.

Measurement was done with a separate BCP eval harness (forthcoming PR); the harness data is included here as motivating evidence — the compaction feature itself has no external runtime dependency.

Design

Protocol — CompactionStrategy

class CompactionStrategy(abc.ABC):
    def __init__(self, *, threshold: int = 0) -> None: ...
    def should_compact(self, context: ChatContext) -> bool: ...
    async def maybe_compact(
        self, context, *, backend=None, goal=None
    ) -> ChatContext: ...
    @abc.abstractmethod
    async def compact(
        self, context, *, backend=None, goal=None
    ) -> ChatContext: ...

Threshold is compared against total_tokens from the most recent ModelOutputThunk.usage — i.e. the prompt+completion of the last LLM call.

Concrete strategies

  • ClearAll(threshold) — discard everything after the first ReactInitiator, keeping only the system prefix. Cheapest, most aggressive; model must rebuild context each cycle.
  • KeepLastN(keep_n, threshold) — retain prefix + the last keep_n body components. Middle-ground; preserves recent tool outputs.
  • LLMSummarize(keep_n, threshold) — summarize the older body components via an additional LLM call, keep the last keep_n verbatim. Highest-fidelity, most expensive per-fire; empirically the best for Granite 4.1 on BCP (see table).

Integration site (react.py, +13 lines)

async def react(
    ..., compaction: CompactionStrategy | None = None,
) -> tuple[ComputedModelOutputThunk[str], ChatContext]:
    ...
    while turn_num < loop_budget or loop_budget == -1:
        step, next_context = await mfuncs.aact(...)
        ...
        if is_final:
            return step, context

        # Compact AFTER the final-answer check so terminal turns skip it.
        if compaction is not None:
            context = await compaction.maybe_compact(
                context, backend=backend, goal=goal
            )

Design decision — compact after the is_final check: a terminal turn has no next iteration to benefit from compaction, and for LLMSummarize this saves a full LLM call per question that would otherwise be wasted. Flagged inline so future refactors don't regress it.

Non-goals

  • No changes to mellea/core/ — the feature is purely additive under mellea/stdlib/, reusing the existing ChatContext + ModelOutputThunk surfaces.

Test plan

  • uv run pytest test/stdlib/test_compaction.py — 26 tests, 11 s locally, no external services required

The test file uses DummyThunk with synthetic usage dicts — no real backend needed. Coverage includes:

  • Each strategy's compact() output shape
  • Token-count comparison (below, at, above threshold)
  • threshold=0 disables compaction
  • Empty-context / no-thunk-with-usage returns False
  • Prefix preservation (first ReactInitiator never dropped)
  • LLMSummarize error handling when backend/goal omitted

Pitfalls (flagged here so reviewers know what to watch for)

  1. Backends that don't populate mot.usage silently disable compaction. All mainline Mellea backends set it (OpenAI, HF, Ollama, LiteLLM), and AGENTS.md §5 codifies this as a requirement. If a new backend lands without usage population, its users will see compaction become a no-op with no loud error.

  2. One-turn lag in the token-count measurement. The count reflects the prompt+completion of the LLM call that just completed — tool responses appended since are not yet counted. In practice negligible (a typical tool response is <5K tokens relative to a 50K+ threshold). Becomes visible only if a single tool response is very large (e.g. a raw document dump). Documented in the compaction.py docstring.

  3. _last_usage_tokens returns None before the first model call. should_compact then returns False — no-op, not an error. Matters for strategies with very low thresholds where the first LLM call itself crosses the bar.

  4. LLMSummarize needs backend + goal at compact() time. These are forwarded from the react call site. If a user constructs LLMSummarize and calls compact() directly (outside react), they must pass both. The docstring says so; reviewers may prefer a runtime check.

  5. Strategies that drop all body components can leave a thunk-less context. On the next check, _last_usage_tokens returns None and compaction correctly doesn't re-fire immediately — behavior we want, but worth verifying didn't regress.

  6. Token-count threshold is absolute, not a percentage of max context. A possible enhancement is to express the threshold as a percentage of the model's max context length rather than an absolute token count. This would require backends to reliably report that limit (e.g. 131K for Granite 4.1), after which --compaction_threshold 0.5 would read as "fire at 50% of context".

  7. Compaction firing point is after the is_final check, not before. Easy to accidentally swap in a future refactor. Code comment calls this out; ideally a regression test guards it, but the current tests focus on the strategies themselves rather than react-loop placement. Worth adding if the reviewer flags it.

Files

 mellea/stdlib/compaction.py
 mellea/stdlib/frameworks/react.py
 test/stdlib/test_compaction.py
 1 file changed, 2 added

yelkurdi and others added 4 commits April 30, 2026 12:24
Adds CompactionStrategy abstraction and KeepLastN implementation to
mellea/stdlib/compaction.py, wires an optional compaction parameter into
the react() loop, and adds full test coverage in test/stdlib/test_compaction.py.

Assisted-by: Claude Code
Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Switches `CompactionStrategy.threshold` from a component-count trigger to a
token-count trigger, read from the most recent `ModelOutputThunk.usage`
populated by the backend. This aligns compaction with the real constraint
(context size) and sidesteps per-backend tokenizer dependencies by using
provider-reported usage; the trade-off is a one-turn lag since usage is
recorded at the end of each model call.

Also reorders the react loop so compaction runs after the final-answer
check, skipping wasted work (and a wasted LLM call for LLMSummarize) on
terminal turns.

Assisted-by: Claude Code
Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Move the compaction strategies alongside the react framework they serve:
- mellea/stdlib/compaction.py -> mellea/stdlib/frameworks/react_compaction.py
- test/stdlib/test_compaction.py -> test/stdlib/frameworks/test_react_compaction.py

Imports and module docstrings updated accordingly.

Assisted-by: Claude Code
Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
@yelkurdi yelkurdi requested a review from a team as a code owner May 1, 2026 19:59
@yelkurdi yelkurdi requested review from markstur and nrfulton May 1, 2026 19:59
@github-actions github-actions Bot added the enhancement New feature or request label May 1, 2026
@yelkurdi
Copy link
Copy Markdown
Contributor Author

yelkurdi commented May 1, 2026

please add @ramon-astudillo as an observer

@psschwei
Copy link
Copy Markdown
Member

psschwei commented May 4, 2026

(original PR body)

Summary

Adds an optional CompactionStrategy to mellea.stdlib.frameworks.react with three concrete implementations (ClearAll, KeepLastN, LLMSummarize) under a new module mellea.stdlib.compaction. Strategies fire when the running context's token count crosses a configurable threshold, measured from the provider-reported usage on the last ModelOutputThunk.

Empirically on the BCP benchmark with Granite 4.1-8b, llm_summarize cuts inference cost by 23.7% and raises accuracy by 3.5 pp — compaction is a dual win, not a quality/cost trade-off.

Backwards compatible: compaction=None (default) preserves existing react() behavior exactly.

Motivation

Long agentic loops — especially retrieval-heavy ones — pile up tool responses. Each react iteration re-sends the full history to the model, so prompt-token cost grows quadratically and the loop can exhaust the model's context window before reaching a final answer. Compaction trims that history, lowering both the dollar cost of inference and the likelihood of hitting the context / timeout wall.

On the BrowseCompPlus (BCP) benchmark with Granite 4.1-8b (131K context, 830 questions, loop_budget=400, per_question_timeout_s=1800, averaged across 3 runs per strategy):

Compaction Est. cost @ 80% cache hit Accuracy Mean wall-clock per Q
none $1801.0M 15.6% 724 s
clear_all $1499.2M (−16.8%) 18.5% (+2.9 pp) 980 s
llm_summarize $1373.9M (−23.7%) 19.1% (+3.5 pp) 709 s
  • Without compaction, inference cost on the 830-Q set is ~$1801M under an 80%-cache-hit pricing model, because each react iteration re-sends an ever-larger prompt (history + tool outputs) — the prompt-token total balloons to 5.4B before the 1800 s wall / context limit stops further progress.
  • llm_summarize is the clear winner: it cuts inference cost by −23.7% ($1801M → $1374M) and raises accuracy from 15.6% to 19.1% (+3.5 pp, +23% relative), at similar mean wall-clock. The summary shrinks the prompt enough that each iteration is actually faster, so both cost and quality improve.
  • clear_all reduces cost by −16.8% and lifts accuracy +2.9 pp, but its +35% wall-clock makes it less attractive than llm_summarize when the judge budget allows the extra summarization call.

Hardware / infrastructure for the table above:

  • Model: ibm-granite/granite-4.1-8b (bf16, native 131K context)
  • Node: single 8× NVIDIA H100 80 GiB host (IBM LSF cluster, exclusive GPU allocation)
  • Inference: vLLM 0.19.1, 7 instances × TP=1 (1 GPU each), with --enable-prefix-caching on (Granite-family default). GPU 0 reserved for the BCP local search service (tevatron + BM25 over the BCP corpus).
  • Agent: Mellea react loop via OpenAIBackend → local vLLM, concurrency=56 (8 × num_vllm), loop_budget=400, per_question_timeout_s=1800. Threshold: --compaction_threshold 50000 tokens (~38% of 131K), --compaction_keep_n 5 where applicable.
  • Questions: all 830 from the BCP parquet, decrypted at load.
  • Judge: openai/gpt-oss-120b on the same node after the agent phase teardown. "Correct" = judge verdict matches the reference answer.
  • 3 runs per strategy, numbers in the table are means.

Measurement was done with a separate BCP eval harness (forthcoming PR); the harness data is included here as motivating evidence — the compaction feature itself has no external runtime dependency.

Design

Protocol — CompactionStrategy

class CompactionStrategy(abc.ABC):
    def __init__(self, *, threshold: int = 0) -> None: ...
    def should_compact(self, context: ChatContext) -> bool: ...
    async def maybe_compact(
        self, context, *, backend=None, goal=None
    ) -> ChatContext: ...
    @abc.abstractmethod
    async def compact(
        self, context, *, backend=None, goal=None
    ) -> ChatContext: ...

Threshold is compared against total_tokens from the most recent ModelOutputThunk.usage — i.e. the prompt+completion of the last LLM call.

Concrete strategies

  • ClearAll(threshold) — discard everything after the first ReactInitiator, keeping only the system prefix. Cheapest, most aggressive; model must rebuild context each cycle.
  • KeepLastN(keep_n, threshold) — retain prefix + the last keep_n body components. Middle-ground; preserves recent tool outputs.
  • LLMSummarize(keep_n, threshold) — summarize the older body components via an additional LLM call, keep the last keep_n verbatim. Highest-fidelity, most expensive per-fire; empirically the best for Granite 4.1 on BCP (see table).

Integration site (react.py, +13 lines)

async def react(
    ..., compaction: CompactionStrategy | None = None,
) -> tuple[ComputedModelOutputThunk[str], ChatContext]:
    ...
    while turn_num < loop_budget or loop_budget == -1:
        step, next_context = await mfuncs.aact(...)
        ...
        if is_final:
            return step, context

        # Compact AFTER the final-answer check so terminal turns skip it.
        if compaction is not None:
            context = await compaction.maybe_compact(
                context, backend=backend, goal=goal
            )

Design decision — compact after the is_final check: a terminal turn has no next iteration to benefit from compaction, and for LLMSummarize this saves a full LLM call per question that would otherwise be wasted. Flagged inline so future refactors don't regress it.

Non-goals

  • No changes to mellea/core/ — the feature is purely additive under mellea/stdlib/, reusing the existing ChatContext + ModelOutputThunk surfaces.

Test plan

  • uv run pytest test/stdlib/test_compaction.py — 26 tests, 11 s locally, no external services required

The test file uses DummyThunk with synthetic usage dicts — no real backend needed. Coverage includes:

  • Each strategy's compact() output shape
  • Token-count comparison (below, at, above threshold)
  • threshold=0 disables compaction
  • Empty-context / no-thunk-with-usage returns False
  • Prefix preservation (first ReactInitiator never dropped)
  • LLMSummarize error handling when backend/goal omitted

Pitfalls (flagged here so reviewers know what to watch for)

  1. Backends that don't populate mot.usage silently disable compaction. All mainline Mellea backends set it (OpenAI, HF, Ollama, LiteLLM), and AGENTS.md §5 codifies this as a requirement. If a new backend lands without usage population, its users will see compaction become a no-op with no loud error.

  2. One-turn lag in the token-count measurement. The count reflects the prompt+completion of the LLM call that just completed — tool responses appended since are not yet counted. In practice negligible (a typical tool response is <5K tokens relative to a 50K+ threshold). Becomes visible only if a single tool response is very large (e.g. a raw document dump). Documented in the compaction.py docstring.

  3. _last_usage_tokens returns None before the first model call. should_compact then returns False — no-op, not an error. Matters for strategies with very low thresholds where the first LLM call itself crosses the bar.

  4. LLMSummarize needs backend + goal at compact() time. These are forwarded from the react call site. If a user constructs LLMSummarize and calls compact() directly (outside react), they must pass both. The docstring says so; reviewers may prefer a runtime check.

  5. Strategies that drop all body components can leave a thunk-less context. On the next check, _last_usage_tokens returns None and compaction correctly doesn't re-fire immediately — behavior we want, but worth verifying didn't regress.

  6. Token-count threshold is absolute, not a percentage of max context. A possible enhancement is to express the threshold as a percentage of the model's max context length rather than an absolute token count. This would require backends to reliably report that limit (e.g. 131K for Granite 4.1), after which --compaction_threshold 0.5 would read as "fire at 50% of context".

  7. Compaction firing point is after the is_final check, not before. Easy to accidentally swap in a future refactor. Code comment calls this out; ideally a regression test guards it, but the current tests focus on the strategies themselves rather than react-loop placement. Worth adding if the reviewer flags it.

Files

 mellea/stdlib/frameworks/react_compaction.py
 mellea/stdlib/frameworks/react.py
 test/stdlib/frameworks/test_react_compaction.py
 1 file modified, 2 added

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

The PR description has been updated. Please fill out the template for your PR to be reviewed.

@psschwei
Copy link
Copy Markdown
Member

psschwei commented May 4, 2026

@yelkurdi I updated the PR body so that the update-pr-body check would pass (and copied your original body into a comment). Now that the check is passing, feel free to re-edit to include your original comments in the appropriate section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants