feat: context compaction strategies for the react loop#996
feat: context compaction strategies for the react loop#996yelkurdi wants to merge 4 commits intogenerative-computing:mainfrom
Conversation
Adds CompactionStrategy abstraction and KeepLastN implementation to mellea/stdlib/compaction.py, wires an optional compaction parameter into the react() loop, and adds full test coverage in test/stdlib/test_compaction.py. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Switches `CompactionStrategy.threshold` from a component-count trigger to a token-count trigger, read from the most recent `ModelOutputThunk.usage` populated by the backend. This aligns compaction with the real constraint (context size) and sidesteps per-backend tokenizer dependencies by using provider-reported usage; the trade-off is a one-turn lag since usage is recorded at the end of each model call. Also reorders the react loop so compaction runs after the final-answer check, skipping wasted work (and a wasted LLM call for LLMSummarize) on terminal turns. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Move the compaction strategies alongside the react framework they serve: - mellea/stdlib/compaction.py -> mellea/stdlib/frameworks/react_compaction.py - test/stdlib/test_compaction.py -> test/stdlib/frameworks/test_react_compaction.py Imports and module docstrings updated accordingly. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
|
please add @ramon-astudillo as an observer |
|
(original PR body) SummaryAdds an optional Empirically on the BCP benchmark with Granite 4.1-8b, Backwards compatible: MotivationLong agentic loops — especially retrieval-heavy ones — pile up tool responses. Each react iteration re-sends the full history to the model, so prompt-token cost grows quadratically and the loop can exhaust the model's context window before reaching a final answer. Compaction trims that history, lowering both the dollar cost of inference and the likelihood of hitting the context / timeout wall. On the BrowseCompPlus (BCP) benchmark with Granite 4.1-8b (131K context, 830 questions,
Hardware / infrastructure for the table above:
Measurement was done with a separate BCP eval harness (forthcoming PR); the harness data is included here as motivating evidence — the compaction feature itself has no external runtime dependency. DesignProtocol —
|
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
|
@yelkurdi I updated the PR body so that the update-pr-body check would pass (and copied your original body into a comment). Now that the check is passing, feel free to re-edit to include your original comments in the appropriate section. |
Component PR
Use this template when adding or modifying components in
mellea/stdlib/components/.Description
Implementation Checklist
Protocol Compliance
parts()returns list of constituent parts (Components or CBlocks)format_for_llm()returns TemplateRepresentation or string_parse(computed: ModelOutputThunk)parses model output correctly into the specified Component return typeContent Blocks
Integration
mellea/stdlib/components/__init__.pyor, if you are adding a library of components, from your sub-moduleTesting
tests/components/Attribution
Summary
Adds an optional
CompactionStrategytomellea.stdlib.frameworks.reactwith three concrete implementations (ClearAll,KeepLastN,LLMSummarize) under a new modulemellea.stdlib.compaction. Strategies fire when the running context's token count crosses a configurable threshold, measured from the provider-reportedusageon the lastModelOutputThunk.Empirically on the BCP benchmark with Granite 4.1-8b,
llm_summarizecuts inference cost by 23.7% and raises accuracy by 3.5 pp — compaction is a dual win, not a quality/cost trade-off.Backwards compatible:
compaction=None(default) preserves existingreact()behavior exactly.Motivation
Long agentic loops — especially retrieval-heavy ones — pile up tool responses. Each react iteration re-sends the full history to the model, so prompt-token cost grows quadratically and the loop can exhaust the model's context window before reaching a final answer. Compaction trims that history, lowering both the dollar cost of inference and the likelihood of hitting the context / timeout wall.
On the BrowseCompPlus (BCP) benchmark with Granite 4.1-8b (131K context, 830 questions,
loop_budget=400,per_question_timeout_s=1800, averaged across 3 runs per strategy):noneclear_allllm_summarizellm_summarizeis the clear winner: it cuts inference cost by −23.7% ($1801M → $1374M) and raises accuracy from 15.6% to 19.1% (+3.5 pp, +23% relative), at similar mean wall-clock. The summary shrinks the prompt enough that each iteration is actually faster, so both cost and quality improve.clear_allreduces cost by −16.8% and lifts accuracy +2.9 pp, but its +35% wall-clock makes it less attractive thanllm_summarizewhen the judge budget allows the extra summarization call.Hardware / infrastructure for the table above:
ibm-granite/granite-4.1-8b(bf16, native 131K context)--enable-prefix-cachingon (Granite-family default). GPU 0 reserved for the BCP local search service (tevatron + BM25 over the BCP corpus).OpenAIBackend→ local vLLM,concurrency=56(8 × num_vllm),loop_budget=400,per_question_timeout_s=1800. Threshold:--compaction_threshold 50000tokens (~38% of 131K),--compaction_keep_n 5where applicable.openai/gpt-oss-120bon the same node after the agent phase teardown. "Correct" = judge verdict matches the reference answer.Measurement was done with a separate BCP eval harness (forthcoming PR); the harness data is included here as motivating evidence — the compaction feature itself has no external runtime dependency.
Design
Protocol —
CompactionStrategyThreshold is compared against
total_tokensfrom the most recentModelOutputThunk.usage— i.e. the prompt+completion of the last LLM call.Concrete strategies
ClearAll(threshold)— discard everything after the firstReactInitiator, keeping only the system prefix. Cheapest, most aggressive; model must rebuild context each cycle.KeepLastN(keep_n, threshold)— retain prefix + the lastkeep_nbody components. Middle-ground; preserves recent tool outputs.LLMSummarize(keep_n, threshold)— summarize the older body components via an additional LLM call, keep the lastkeep_nverbatim. Highest-fidelity, most expensive per-fire; empirically the best for Granite 4.1 on BCP (see table).Integration site (react.py, +13 lines)
Design decision — compact after the is_final check: a terminal turn has no next iteration to benefit from compaction, and for
LLMSummarizethis saves a full LLM call per question that would otherwise be wasted. Flagged inline so future refactors don't regress it.Non-goals
mellea/core/— the feature is purely additive undermellea/stdlib/, reusing the existingChatContext+ModelOutputThunksurfaces.Test plan
uv run pytest test/stdlib/test_compaction.py— 26 tests, 11 s locally, no external services requiredThe test file uses
DummyThunkwith syntheticusagedicts — no real backend needed. Coverage includes:threshold=0disables compactionPitfalls (flagged here so reviewers know what to watch for)
Backends that don't populate
mot.usagesilently disable compaction. All mainline Mellea backends set it (OpenAI, HF, Ollama, LiteLLM), andAGENTS.md§5 codifies this as a requirement. If a new backend lands withoutusagepopulation, its users will see compaction become a no-op with no loud error.One-turn lag in the token-count measurement. The count reflects the prompt+completion of the LLM call that just completed — tool responses appended since are not yet counted. In practice negligible (a typical tool response is <5K tokens relative to a 50K+ threshold). Becomes visible only if a single tool response is very large (e.g. a raw document dump). Documented in the
compaction.pydocstring._last_usage_tokensreturns None before the first model call.should_compactthen returns False — no-op, not an error. Matters for strategies with very low thresholds where the first LLM call itself crosses the bar.LLMSummarizeneedsbackend+goalatcompact()time. These are forwarded from the react call site. If a user constructsLLMSummarizeand callscompact()directly (outside react), they must pass both. The docstring says so; reviewers may prefer a runtime check.Strategies that drop all body components can leave a thunk-less context. On the next check,
_last_usage_tokensreturns None and compaction correctly doesn't re-fire immediately — behavior we want, but worth verifying didn't regress.Token-count threshold is absolute, not a percentage of max context. A possible enhancement is to express the threshold as a percentage of the model's max context length rather than an absolute token count. This would require backends to reliably report that limit (e.g. 131K for Granite 4.1), after which
--compaction_threshold 0.5would read as "fire at 50% of context".Compaction firing point is after the is_final check, not before. Easy to accidentally swap in a future refactor. Code comment calls this out; ideally a regression test guards it, but the current tests focus on the strategies themselves rather than react-loop placement. Worth adding if the reviewer flags it.
Files