feat: context compaction strategies for the react loop by yelkurdi · Pull Request #996 · generative-computing/mellea

yelkurdi · 2026-05-01T19:59:06Z

Component PR

Use this template when adding or modifying components in mellea/stdlib/components/.

Description

Link to Issue: Fixes

Implementation Checklist

Protocol Compliance

parts() returns list of constituent parts (Components or CBlocks)
format_for_llm() returns TemplateRepresentation or string
_parse(computed: ModelOutputThunk) parses model output correctly into the specified Component return type

Content Blocks

CBlock used appropriately for text content
ImageBlock used for image content (if applicable)

Integration

Component exported in mellea/stdlib/components/__init__.py or, if you are adding a library of components, from your sub-module

Testing

Tests added to tests/components/
New code has 100% coverage
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

AI coding assistants used

Summary

Adds an optional CompactionStrategy to mellea.stdlib.frameworks.react with three concrete implementations (ClearAll, KeepLastN, LLMSummarize) under a new module mellea.stdlib.compaction. Strategies fire when the running context's token count crosses a configurable threshold, measured from the provider-reported usage on the last ModelOutputThunk.

Empirically on the BCP benchmark with Granite 4.1-8b, llm_summarize cuts inference cost by 23.7% and raises accuracy by 3.5 pp — compaction is a dual win, not a quality/cost trade-off.

Backwards compatible: compaction=None (default) preserves existing react() behavior exactly.

Motivation

Long agentic loops — especially retrieval-heavy ones — pile up tool responses. Each react iteration re-sends the full history to the model, so prompt-token cost grows quadratically and the loop can exhaust the model's context window before reaching a final answer. Compaction trims that history, lowering both the dollar cost of inference and the likelihood of hitting the context / timeout wall.

On the BrowseCompPlus (BCP) benchmark with Granite 4.1-8b (131K context, 830 questions, loop_budget=400, per_question_timeout_s=1800, averaged across 3 runs per strategy):

Compaction	Est. cost @ 80% cache hit	Accuracy	Mean wall-clock per Q
`none`	$1801.0M	15.6%	724 s
`clear_all`	$1499.2M (−16.8%)	18.5% (+2.9 pp)	980 s
`llm_summarize`	$1373.9M (−23.7%)	19.1% (+3.5 pp)	709 s

Without compaction, inference cost on the 830-Q set is ~$1801M under an 80%-cache-hit pricing model, because each react iteration re-sends an ever-larger prompt (history + tool outputs) — the prompt-token total balloons to 5.4B before the 1800 s wall / context limit stops further progress.
llm_summarize is the clear winner: it cuts inference cost by −23.7% ($1801M → $1374M) and raises accuracy from 15.6% to 19.1% (+3.5 pp, +23% relative), at similar mean wall-clock. The summary shrinks the prompt enough that each iteration is actually faster, so both cost and quality improve.
clear_all reduces cost by −16.8% and lifts accuracy +2.9 pp, but its +35% wall-clock makes it less attractive than llm_summarize when the judge budget allows the extra summarization call.

Hardware / infrastructure for the table above:

Model: ibm-granite/granite-4.1-8b (bf16, native 131K context)
Node: single 8× NVIDIA H100 80 GiB host (IBM LSF cluster, exclusive GPU allocation)
Inference: vLLM 0.19.1, 7 instances × TP=1 (1 GPU each), with --enable-prefix-caching on (Granite-family default). GPU 0 reserved for the BCP local search service (tevatron + BM25 over the BCP corpus).
Agent: Mellea react loop via OpenAIBackend → local vLLM, concurrency=56 (8 × num_vllm), loop_budget=400, per_question_timeout_s=1800. Threshold: --compaction_threshold 50000 tokens (~38% of 131K), --compaction_keep_n 5 where applicable.
Questions: all 830 from the BCP parquet, decrypted at load.
Judge: openai/gpt-oss-120b on the same node after the agent phase teardown. "Correct" = judge verdict matches the reference answer.
3 runs per strategy, independently seeded; numbers in the table are means.

Measurement was done with a separate BCP eval harness (forthcoming PR); the harness data is included here as motivating evidence — the compaction feature itself has no external runtime dependency.

Design

Protocol — `CompactionStrategy`

class CompactionStrategy(abc.ABC):
    def __init__(self, *, threshold: int = 0) -> None: ...
    def should_compact(self, context: ChatContext) -> bool: ...
    async def maybe_compact(
        self, context, *, backend=None, goal=None
    ) -> ChatContext: ...
    @abc.abstractmethod
    async def compact(
        self, context, *, backend=None, goal=None
    ) -> ChatContext: ...

Threshold is compared against total_tokens from the most recent ModelOutputThunk.usage — i.e. the prompt+completion of the last LLM call.

Concrete strategies

ClearAll(threshold) — discard everything after the first ReactInitiator, keeping only the system prefix. Cheapest, most aggressive; model must rebuild context each cycle.
KeepLastN(keep_n, threshold) — retain prefix + the last keep_n body components. Middle-ground; preserves recent tool outputs.
LLMSummarize(keep_n, threshold) — summarize the older body components via an additional LLM call, keep the last keep_n verbatim. Highest-fidelity, most expensive per-fire; empirically the best for Granite 4.1 on BCP (see table).

Integration site (react.py, +13 lines)

async def react(
    ..., compaction: CompactionStrategy | None = None,
) -> tuple[ComputedModelOutputThunk[str], ChatContext]:
    ...
    while turn_num < loop_budget or loop_budget == -1:
        step, next_context = await mfuncs.aact(...)
        ...
        if is_final:
            return step, context

        # Compact AFTER the final-answer check so terminal turns skip it.
        if compaction is not None:
            context = await compaction.maybe_compact(
                context, backend=backend, goal=goal
            )

Design decision — compact after the is_final check: a terminal turn has no next iteration to benefit from compaction, and for LLMSummarize this saves a full LLM call per question that would otherwise be wasted. Flagged inline so future refactors don't regress it.

Non-goals

No changes to mellea/core/ — the feature is purely additive under mellea/stdlib/, reusing the existing ChatContext + ModelOutputThunk surfaces.

Test plan

uv run pytest test/stdlib/test_compaction.py — 26 tests, 11 s locally, no external services required

The test file uses DummyThunk with synthetic usage dicts — no real backend needed. Coverage includes:

Each strategy's compact() output shape
Token-count comparison (below, at, above threshold)
threshold=0 disables compaction
Empty-context / no-thunk-with-usage returns False
Prefix preservation (first ReactInitiator never dropped)
LLMSummarize error handling when backend/goal omitted

Pitfalls (flagged here so reviewers know what to watch for)

Backends that don't populate mot.usage silently disable compaction. All mainline Mellea backends set it (OpenAI, HF, Ollama, LiteLLM), and AGENTS.md §5 codifies this as a requirement. If a new backend lands without usage population, its users will see compaction become a no-op with no loud error.
One-turn lag in the token-count measurement. The count reflects the prompt+completion of the LLM call that just completed — tool responses appended since are not yet counted. In practice negligible (a typical tool response is <5K tokens relative to a 50K+ threshold). Becomes visible only if a single tool response is very large (e.g. a raw document dump). Documented in the compaction.py docstring.
_last_usage_tokens returns None before the first model call. should_compact then returns False — no-op, not an error. Matters for strategies with very low thresholds where the first LLM call itself crosses the bar.
LLMSummarize needs backend + goal at compact() time. These are forwarded from the react call site. If a user constructs LLMSummarize and calls compact() directly (outside react), they must pass both. The docstring says so; reviewers may prefer a runtime check.
Strategies that drop all body components can leave a thunk-less context. On the next check, _last_usage_tokens returns None and compaction correctly doesn't re-fire immediately — behavior we want, but worth verifying didn't regress.
Token-count threshold is absolute, not a percentage of max context. A possible enhancement is to express the threshold as a percentage of the model's max context length rather than an absolute token count. This would require backends to reliably report that limit (e.g. 131K for Granite 4.1), after which --compaction_threshold 0.5 would read as "fire at 50% of context".
Compaction firing point is after the is_final check, not before. Easy to accidentally swap in a future refactor. Code comment calls this out; ideally a regression test guards it, but the current tests focus on the strategies themselves rather than react-loop placement. Worth adding if the reviewer flags it.

Files

 mellea/stdlib/compaction.py
 mellea/stdlib/frameworks/react.py
 test/stdlib/test_compaction.py
 1 file changed, 2 added

Adds CompactionStrategy abstraction and KeepLastN implementation to mellea/stdlib/compaction.py, wires an optional compaction parameter into the react() loop, and adds full test coverage in test/stdlib/test_compaction.py. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>

Switches `CompactionStrategy.threshold` from a component-count trigger to a token-count trigger, read from the most recent `ModelOutputThunk.usage` populated by the backend. This aligns compaction with the real constraint (context size) and sidesteps per-backend tokenizer dependencies by using provider-reported usage; the trade-off is a one-turn lag since usage is recorded at the end of each model call. Also reorders the react loop so compaction runs after the final-answer check, skipping wasted work (and a wasted LLM call for LLMSummarize) on terminal turns. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>

Move the compaction strategies alongside the react framework they serve: - mellea/stdlib/compaction.py -> mellea/stdlib/frameworks/react_compaction.py - test/stdlib/test_compaction.py -> test/stdlib/frameworks/test_react_compaction.py Imports and module docstrings updated accordingly. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>

yelkurdi · 2026-05-01T20:00:35Z

please add @ramon-astudillo as an observer

psschwei · 2026-05-04T15:59:39Z

(original PR body)

Summary

Adds an optional CompactionStrategy to mellea.stdlib.frameworks.react with three concrete implementations (ClearAll, KeepLastN, LLMSummarize) under a new module mellea.stdlib.compaction. Strategies fire when the running context's token count crosses a configurable threshold, measured from the provider-reported usage on the last ModelOutputThunk.

Empirically on the BCP benchmark with Granite 4.1-8b, llm_summarize cuts inference cost by 23.7% and raises accuracy by 3.5 pp — compaction is a dual win, not a quality/cost trade-off.

Backwards compatible: compaction=None (default) preserves existing react() behavior exactly.

Motivation

Long agentic loops — especially retrieval-heavy ones — pile up tool responses. Each react iteration re-sends the full history to the model, so prompt-token cost grows quadratically and the loop can exhaust the model's context window before reaching a final answer. Compaction trims that history, lowering both the dollar cost of inference and the likelihood of hitting the context / timeout wall.

On the BrowseCompPlus (BCP) benchmark with Granite 4.1-8b (131K context, 830 questions, loop_budget=400, per_question_timeout_s=1800, averaged across 3 runs per strategy):

Compaction	Est. cost @ 80% cache hit	Accuracy	Mean wall-clock per Q
`none`	$1801.0M	15.6%	724 s
`clear_all`	$1499.2M (−16.8%)	18.5% (+2.9 pp)	980 s
`llm_summarize`	$1373.9M (−23.7%)	19.1% (+3.5 pp)	709 s

Without compaction, inference cost on the 830-Q set is ~$1801M under an 80%-cache-hit pricing model, because each react iteration re-sends an ever-larger prompt (history + tool outputs) — the prompt-token total balloons to 5.4B before the 1800 s wall / context limit stops further progress.
llm_summarize is the clear winner: it cuts inference cost by −23.7% ($1801M → $1374M) and raises accuracy from 15.6% to 19.1% (+3.5 pp, +23% relative), at similar mean wall-clock. The summary shrinks the prompt enough that each iteration is actually faster, so both cost and quality improve.
clear_all reduces cost by −16.8% and lifts accuracy +2.9 pp, but its +35% wall-clock makes it less attractive than llm_summarize when the judge budget allows the extra summarization call.

Hardware / infrastructure for the table above:

Model: ibm-granite/granite-4.1-8b (bf16, native 131K context)
Node: single 8× NVIDIA H100 80 GiB host (IBM LSF cluster, exclusive GPU allocation)
Inference: vLLM 0.19.1, 7 instances × TP=1 (1 GPU each), with --enable-prefix-caching on (Granite-family default). GPU 0 reserved for the BCP local search service (tevatron + BM25 over the BCP corpus).
Agent: Mellea react loop via OpenAIBackend → local vLLM, concurrency=56 (8 × num_vllm), loop_budget=400, per_question_timeout_s=1800. Threshold: --compaction_threshold 50000 tokens (~38% of 131K), --compaction_keep_n 5 where applicable.
Questions: all 830 from the BCP parquet, decrypted at load.
Judge: openai/gpt-oss-120b on the same node after the agent phase teardown. "Correct" = judge verdict matches the reference answer.
3 runs per strategy, numbers in the table are means.

Measurement was done with a separate BCP eval harness (forthcoming PR); the harness data is included here as motivating evidence — the compaction feature itself has no external runtime dependency.

Design

Protocol — `CompactionStrategy`

class CompactionStrategy(abc.ABC):
    def __init__(self, *, threshold: int = 0) -> None: ...
    def should_compact(self, context: ChatContext) -> bool: ...
    async def maybe_compact(
        self, context, *, backend=None, goal=None
    ) -> ChatContext: ...
    @abc.abstractmethod
    async def compact(
        self, context, *, backend=None, goal=None
    ) -> ChatContext: ...

Threshold is compared against total_tokens from the most recent ModelOutputThunk.usage — i.e. the prompt+completion of the last LLM call.

Concrete strategies

ClearAll(threshold) — discard everything after the first ReactInitiator, keeping only the system prefix. Cheapest, most aggressive; model must rebuild context each cycle.
KeepLastN(keep_n, threshold) — retain prefix + the last keep_n body components. Middle-ground; preserves recent tool outputs.
LLMSummarize(keep_n, threshold) — summarize the older body components via an additional LLM call, keep the last keep_n verbatim. Highest-fidelity, most expensive per-fire; empirically the best for Granite 4.1 on BCP (see table).

Integration site (react.py, +13 lines)

async def react(
    ..., compaction: CompactionStrategy | None = None,
) -> tuple[ComputedModelOutputThunk[str], ChatContext]:
    ...
    while turn_num < loop_budget or loop_budget == -1:
        step, next_context = await mfuncs.aact(...)
        ...
        if is_final:
            return step, context

        # Compact AFTER the final-answer check so terminal turns skip it.
        if compaction is not None:
            context = await compaction.maybe_compact(
                context, backend=backend, goal=goal
            )

Design decision — compact after the is_final check: a terminal turn has no next iteration to benefit from compaction, and for LLMSummarize this saves a full LLM call per question that would otherwise be wasted. Flagged inline so future refactors don't regress it.

Non-goals

No changes to mellea/core/ — the feature is purely additive under mellea/stdlib/, reusing the existing ChatContext + ModelOutputThunk surfaces.

Test plan

uv run pytest test/stdlib/test_compaction.py — 26 tests, 11 s locally, no external services required

The test file uses DummyThunk with synthetic usage dicts — no real backend needed. Coverage includes:

Each strategy's compact() output shape
Token-count comparison (below, at, above threshold)
threshold=0 disables compaction
Empty-context / no-thunk-with-usage returns False
Prefix preservation (first ReactInitiator never dropped)
LLMSummarize error handling when backend/goal omitted

Pitfalls (flagged here so reviewers know what to watch for)

Backends that don't populate mot.usage silently disable compaction. All mainline Mellea backends set it (OpenAI, HF, Ollama, LiteLLM), and AGENTS.md §5 codifies this as a requirement. If a new backend lands without usage population, its users will see compaction become a no-op with no loud error.
One-turn lag in the token-count measurement. The count reflects the prompt+completion of the LLM call that just completed — tool responses appended since are not yet counted. In practice negligible (a typical tool response is <5K tokens relative to a 50K+ threshold). Becomes visible only if a single tool response is very large (e.g. a raw document dump). Documented in the compaction.py docstring.
_last_usage_tokens returns None before the first model call. should_compact then returns False — no-op, not an error. Matters for strategies with very low thresholds where the first LLM call itself crosses the bar.
LLMSummarize needs backend + goal at compact() time. These are forwarded from the react call site. If a user constructs LLMSummarize and calls compact() directly (outside react), they must pass both. The docstring says so; reviewers may prefer a runtime check.
Strategies that drop all body components can leave a thunk-less context. On the next check, _last_usage_tokens returns None and compaction correctly doesn't re-fire immediately — behavior we want, but worth verifying didn't regress.
Token-count threshold is absolute, not a percentage of max context. A possible enhancement is to express the threshold as a percentage of the model's max context length rather than an absolute token count. This would require backends to reliably report that limit (e.g. 131K for Granite 4.1), after which --compaction_threshold 0.5 would read as "fire at 50% of context".
Compaction firing point is after the is_final check, not before. Easy to accidentally swap in a future refactor. Code comment calls this out; ideally a regression test guards it, but the current tests focus on the strategies themselves rather than react-loop placement. Worth adding if the reviewer flags it.

Files

 mellea/stdlib/frameworks/react_compaction.py
 mellea/stdlib/frameworks/react.py
 test/stdlib/frameworks/test_react_compaction.py
 1 file modified, 2 added

github-actions · 2026-05-04T16:03:11Z

The PR description has been updated. Please fill out the template for your PR to be reviewed.

psschwei · 2026-05-04T16:05:12Z

@yelkurdi I updated the PR body so that the update-pr-body check would pass (and copied your original body into a comment). Now that the check is passing, feel free to re-edit to include your original comments in the appropriate section.

yelkurdi and others added 4 commits April 30, 2026 12:24

Fix mot.generation.usage

ca7bea1

yelkurdi requested a review from a team as a code owner May 1, 2026 19:59

yelkurdi requested review from markstur and nrfulton May 1, 2026 19:59

github-actions Bot added the enhancement New feature or request label May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: context compaction strategies for the react loop#996

feat: context compaction strategies for the react loop#996
yelkurdi wants to merge 4 commits intogenerative-computing:mainfrom
yelkurdi:context_compaction_for_react

yelkurdi commented May 1, 2026 •

edited

Loading

Uh oh!

yelkurdi commented May 1, 2026

Uh oh!

psschwei commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

psschwei commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yelkurdi commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Component PR

Description

Implementation Checklist

Protocol Compliance

Content Blocks

Integration

Testing

Attribution

Summary

Motivation

Design

Protocol — CompactionStrategy

Concrete strategies

Integration site (react.py, +13 lines)

Non-goals

Test plan

Pitfalls (flagged here so reviewers know what to watch for)

Files

Uh oh!

yelkurdi commented May 1, 2026

Uh oh!

psschwei commented May 4, 2026

Summary

Motivation

Design

Protocol — CompactionStrategy

Concrete strategies

Integration site (react.py, +13 lines)

Non-goals

Test plan

Pitfalls (flagged here so reviewers know what to watch for)

Files

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

psschwei commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yelkurdi commented May 1, 2026 •

edited

Loading

Protocol — `CompactionStrategy`

Protocol — `CompactionStrategy`