feat(llm-access): proactive Kiro auto-compaction gate using real token usage#24
feat(llm-access): proactive Kiro auto-compaction gate using real token usage#24acking-you wants to merge 2 commits into
Conversation
…n usage Claude Code (and other Anthropic clients) only compact a conversation when the model's context window is nearly full. Against Kiro's 1M-context models that trigger lands at ~80-92% (~800k-920k tokens), at which point the reactive-compaction summary request — which itself carries the whole history — is so close to the real ceiling that it is slow and frequently fails with `input + max_tokens > limit`, surfacing the terminal "Context limit reached" instead of a clean compaction. This adds a proxy-side gate that returns a `Prompt is too long: N > M` response *before* dispatching upstream once a request crosses a configurable token trigger (default 780k via `LLM_ACCESS_KIRO_COMPACT_TRIGGER_TOKENS`; 0/negative disables). The client then compacts while there is still real headroom, so its summary request fits the window and succeeds quickly. No client changes needed. Accuracy: the gate must not threshold on the local `count_all_tokens` estimate alone — it drifts from Kiro's true contextUsage (bridge scaffolding, tokenizer differences, the 15k/ratio rule in `resolve_input_tokens_with_threshold`). So the real resolved input-token count is now stored per conversation-prefix in the anchor index at `record_success` time and recovered pre-dispatch next turn; the gate uses `max(local_estimate, recovered_real_prev)`. The anchor index is keyed by conversation prefix, so after a compaction the prefix changes, recovery misses, and the gate falls back to the (now small) local count — no deadlock. Compaction-summary requests are always exempt (detected via the summary-instruction signature) so the reactive loop can never gate its own summary request. Changes: - llm-access-kiro/cache_sim/anchor_index: ConversationAnchorEntry gains `real_input_tokens: Option<i32>`; `insert` takes it; `get_real_input_tokens` peeks it without bumping recency (+ 4 tests). - llm-access-kiro/cache_sim/simulator: `record_success_from_runtime_projection` stores the real value; `recover_real_input_tokens_from_runtime_projection` reads it. Test-only `record_success` passes None. - llm-access/provider/errors: `LLM_ACCESS_KIRO_COMPACT_TRIGGER_TOKENS` parser (default 780k), `kiro_proactive_compact_response` builder, limit-parameterized message core (+ 6 tests). - llm-access/provider/kiro_dispatch: websearch-path gate (local estimate) + main-path gate (max(local, recovered real)) after anchor recovery, before the upstream call; `request_is_compaction_summary` detector (+ 5 tests). - llm-access/provider/stream_guards: compute + store the resolved real input tokens at both (stream + non-stream) record-success sites. Verified: cargo clippy -D warnings on llm-access-kiro + llm-access; 168 + 287 tests pass; rustfmt on the 5 changed files; deps submodules untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a proactive auto-compaction gate for Kiro requests to trigger client-side conversation compaction before reaching the model's context window limit. It tracks and retrieves the previous turn's real input-token count using the conversation anchor index, configures a default trigger threshold of 780,000 tokens, and exempts compaction summary requests to prevent deadlocks. The review feedback highlights two key improvement opportunities: optimizing the compaction summary detection to avoid string allocations on large payloads by using an allocation-free case-insensitive search, and replacing duration_since with saturating_duration_since in the token retrieval logic to prevent potential panics from clock drift while also refactoring cache borrows.
| fn text_is_compaction_summary_prompt(text: &str) -> bool { | ||
| let lower = text.to_ascii_lowercase(); | ||
| lower.contains("detailed summary of") && lower.contains("conversation") | ||
| } |
There was a problem hiding this comment.
The text_is_compaction_summary_prompt function is called recursively on every string value in the request payload via json_value_contains_compaction_summary. Calling text.to_ascii_lowercase() on every string block allocates a new String of the same size and copies the data. For large payloads (which can be up to 1M tokens / several megabytes of text), this causes significant CPU overhead and memory churn.
Since the search patterns (detailed summary of and conversation) are purely ASCII, we can perform a case-insensitive substring search without any allocations by using slice::eq_ignore_ascii_case on byte windows.
fn contains_ignore_ascii_case(haystack: &str, needle: &str) -> bool {
if needle.is_empty() {
return true;
}
haystack
.as_bytes()
.windows(needle.len())
.any(|window| window.eq_ignore_ascii_case(needle.as_bytes()))
}
fn text_is_compaction_summary_prompt(text: &str) -> bool {
contains_ignore_ascii_case(text, "detailed summary of")
&& contains_ignore_ascii_case(text, "conversation")
}There was a problem hiding this comment.
Fixed in 3d35b79. Replaced the per-string to_ascii_lowercase() allocation with an allocation-free contains_ignore_ascii_case (byte-window eq_ignore_ascii_case), exactly as suggested. Also tightened the scan to only the active request (last message + system) per the codex P2 below, so the haystack is bounded anyway.
| pub fn get_real_input_tokens( | ||
| &mut self, | ||
| anchor: &str, | ||
| now: Instant, | ||
| ttl: Duration, | ||
| ) -> Option<i32> { | ||
| let cache = self.cache.as_mut()?; | ||
| let entry = cache.peek(anchor)?; | ||
| if now.duration_since(entry.last_touched_at) > ttl { | ||
| cache.pop(anchor); | ||
| return None; | ||
| } | ||
| entry.real_input_tokens | ||
| } |
There was a problem hiding this comment.
The current implementation of get_real_input_tokens performs a peek lookup, borrows the entry, and then conditionally calls cache.pop which requires a mutable borrow of the cache. While NLL (Non-Lexical Lifetimes) can sometimes resolve this, it is safer and more idiomatic to extract the required values first to completely release the borrow on the cache before performing any mutation.
Additionally, using now.duration_since(entry.last_touched_at) can panic if now is slightly earlier than entry.last_touched_at due to subtle monotonic clock drifts or virtualization/testing environments. Using saturating_duration_since is a safer alternative to prevent potential panics.
pub fn get_real_input_tokens(
&mut self,
anchor: &str,
now: Instant,
ttl: Duration,
) -> Option<i32> {
let cache = self.cache.as_mut()?;
let (expired, real_tokens) = match cache.peek(anchor) {
Some(entry) => (
now.saturating_duration_since(entry.last_touched_at) > ttl,
entry.real_input_tokens,
),
None => return None,
};
if expired {
cache.pop(anchor);
None
} else {
real_tokens
}
}There was a problem hiding this comment.
Fixed in 3d35b79. recover_token_counts (renamed from get_real_input_tokens) now extracts (expired, token_counts) from the peek borrow first, releasing it before the conditional pop, and uses saturating_duration_since to avoid the monotonic-clock panic. (The entry now carries both real+local counts for the current-turn-delta fix below.)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ab4eb27ddd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| trigger, | ||
| "proactively returning prompt-too-long to trigger client compaction" | ||
| ); | ||
| return kiro_proactive_compact_response(effective_input_tokens, trigger); |
There was a problem hiding this comment.
Record proactive 413s before returning
When a normal Kiro request crosses the proactive compaction trigger, this branch returns a gateway-generated 413 before updating usage_meta or calling any usage recorder. That makes these intentional Prompt is too long responses disappear from the usage/error audit trail, unlike the nearby conversion preflight failures that call record_kiro_preflight_failure and upstream 413s that call record_kiro_usage; the websearch gate at line 246 has the same bypass. Capture the error metadata and record the rejection before returning so operators can see when the new gate fires.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 3d35b79. Both gates (websearch + main path) now capture the error message/body and call record_kiro_preflight_failure before returning the 413, so the proactive rejection appears in the usage/error audit trail like the sibling conversion preflight failures. That recorder is audit-only (zero usage summary + status record, no cooldown/failover), which is the right semantics for an intentional gate.
| let effective_input_tokens = request_input_tokens.max(recovered_real_tokens); | ||
| if effective_input_tokens >= trigger && !request_is_compaction_summary(&payload) { |
There was a problem hiding this comment.
Include the current turn when using recovered real tokens
For conversations where Kiro's real contextUsage is higher than the local estimate, max(request_input_tokens, recovered_real_tokens) can miss the turn that actually crosses the proactive threshold: e.g. a cached previous turn at 760k real tokens plus a new 30k+ user paste may still have a local total below 780k, so this branch dispatches upstream instead of triggering compaction. The stored value represents the previous turn, so the gate needs to add the current-turn delta (or otherwise compare a current real-token estimate), not just take the max, otherwise large follow-up turns can still hit Kiro's hard limit before the proxy fires.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 3d35b79. The anchor now caches both the real and local counts per turn (AnchorTokenCounts), and the gate estimates the current real consumption as real_prev + max(0, local_now - local_prev), gated against max(local_now, estimated_real). So your 760k-real-prev + 30k-new-paste case now estimates ~790k and trips the 780k trigger even though local_now alone is below it. Covered by effective_tokens_adds_current_delta_to_recovered_real.
| payload | ||
| .messages | ||
| .iter() | ||
| .any(|message| json_value_contains_compaction_summary(&message.content)) |
There was a problem hiding this comment.
Limit compaction-summary detection to the active request
Because this scans every historical message, any earlier user/assistant text containing phrases like “detailed summary of ... conversation” marks all subsequent turns as compaction summaries and permanently exempts that conversation from the proactive gate. In that scenario a later large normal turn near the trigger will be sent upstream and can still hit Kiro's hard context failure; the exemption should be based on the current summary instruction (or another precise marker), not arbitrary old history.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 3d35b79. request_is_compaction_summary now scans only the system prompt and the last message, not the whole history, so an earlier turn that quoted "detailed summary of ... conversation" no longer permanently exempts the conversation. Regression test earlier_history_summary_phrase_does_not_exempt_later_turn asserts a later normal turn after such history is NOT exempted.
…onfig Addresses the PR #24 review and migrates the proactive-compaction trigger from an env var to the Postgres-backed AdminRuntimeConfig so it is editable from the frontend admin UI (no redeploy). CR fixes: - (gemini, high) compaction-summary detection no longer allocates a lowercased copy of each string block; `contains_ignore_ascii_case` does an allocation-free `eq_ignore_ascii_case` byte-window search. MB-scale payloads no longer churn memory. - (gemini, med) `recover_token_counts` (was `get_real_input_tokens`) extracts values before the conditional `pop` (clean borrow release) and uses `saturating_duration_since` to avoid a monotonic-clock panic. - (codex, P2) proactive 413s are now recorded via `record_kiro_preflight_failure` (audit-only, no cooldown) on both gates, so the rejection shows in the usage/error trail like sibling preflight failures. - (codex, P2) the gate no longer uses `max(local, recovered_real)` which missed the current turn. The anchor now caches both real and local counts per turn; the gate estimates `real_prev + max(0, local_now - local_prev)` so a large follow-up turn still trips the trigger. - (codex, P2) compaction-summary detection scans only the active request (the last message + system), not all history, so an earlier turn that merely quoted the instruction can no longer permanently exempt the conversation. Config migration (env -> PG -> frontend): - `LLM_ACCESS_KIRO_COMPACT_TRIGGER_TOKENS` env var removed; replaced by `AdminRuntimeConfig.kiro_compact_trigger_tokens` (default 780000, 0 = off), propagated to `ProviderKiroRoute.compact_trigger_tokens` and the Valkey RouteSnapshot. Migration 0019 + 0001_init column (CHECK >= 0). Admin surface validates with `validate_max` (0 allowed, cap 1,000,000). Frontend api.rs + /admin/kiro-gateway gain a "Compaction Trigger Tokens (0 disables)" field. - The gate reads `route.compact_trigger_tokens` at both gate points. Verified: clippy -D warnings on core/store/migrations/kiro/llm-access; frontend wasm check; kiro 168 + core 10 + llm-access 291 tests pass (incl. new anchor/gate/admin tests); rustfmt on the 24 changed files; deps submodules untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CR addressed + trigger moved to PG runtime config (3d35b79)All 5 review findings fixed (replied inline): allocation-free compaction detection (gemini high), Beyond the CR — config now lives in Postgres, editable from the frontend. The The current-turn-delta fix (codex P2) is what made the accuracy work concrete: the anchor caches both the real (contextUsage-derived) and local ( Verified: clippy |
Summary
Anthropic clients (Claude Code) only compact a conversation when the model's context window is nearly full. Against Kiro's 1M-context models that trigger lands at ~80–92% (~800k–920k tokens), and at that point the client's reactive-compaction summary request — which itself carries the whole history — is so close to the real ceiling that it is slow and frequently fails with
input + max_tokens > limit, surfacing the terminalContext limit reachedinstead of a clean compaction.This adds a proxy-side gate that returns a
Prompt is too long: N > Mresponse before dispatching upstream once a request crosses a configurable token trigger. The client then compacts while there's still real headroom, so its summary request fits the window and succeeds quickly. No client changes needed.Root cause
Prompt is too long: <actual> tokens > <limit> tokensis the error shape that drives the client's reactive compaction. Today it only appears once Kiro itself rejects the request near 1M — too late for the summary to fit. Firing it early, at a soft trigger below the real window, is the whole fix.What changed
LLM_ACCESS_KIRO_COMPACT_TRIGGER_TOKENS(default 780000;0/negative disables the gate — the model's real window still applies).count_all_tokensestimate alone — it drifts from Kiro's true contextUsage (bridge scaffolding, tokenizer differences, the 15k/ratio rule inresolve_input_tokens_with_threshold). So the real resolved input-token count is stored per conversation-prefix in the anchor index atrecord_successtime and recovered pre-dispatch next turn; the gate usesmax(local_estimate, recovered_real_prev).Files
llm-access-kiro/cache_sim/anchor_index:ConversationAnchorEntry.real_input_tokens+get_real_input_tokens(peek, no recency bump).llm-access-kiro/cache_sim/simulator: store onrecord_success_from_runtime_projection, read viarecover_real_input_tokens_from_runtime_projection.llm-access/provider/errors: trigger parser +kiro_proactive_compact_responsebuilder + limit-parameterized message core.llm-access/provider/kiro_dispatch: websearch-path gate (local estimate) + main-path gate (max(local, recovered real)) after anchor recovery, before the upstream call;request_is_compaction_summarydetector.llm-access/provider/stream_guards: compute + store resolved real input tokens at both record-success sites.Verification
cargo clippy -p llm-access-kiro -p llm-access --all-targets -- -D warnings→ cleancargo test→ llm-access-kiro 168, llm-access 287 (incl. 15 new: 4 anchor-index, 6 errors-config, 5 compaction-detector)rustfmton the 5 changed files only;deps/lance&deps/lancedbuntouchedNotes for reviewers
LLM_ACCESS_KIRO_COMPACT_TRIGGER_TOKENSis read once viaLazyLock; documented on theDEFAULT_KIRO_COMPACT_TRIGGER_TOKENSconstant (matching how sibling provider tunables are documented — code-local, not in the runbook).max(...)with the current local estimate keeps the newest delta in play. First turn (no stored value) falls back to local — correct, since early turns are small.🤖 Generated with Claude Code