feat(llm-access): proactive Kiro auto-compaction gate using real token usage by acking-you · Pull Request #24 · acking-you/static_flow

acking-you · 2026-06-01T20:05:02Z

Summary

Anthropic clients (Claude Code) only compact a conversation when the model's context window is nearly full. Against Kiro's 1M-context models that trigger lands at ~80–92% (~800k–920k tokens), and at that point the client's reactive-compaction summary request — which itself carries the whole history — is so close to the real ceiling that it is slow and frequently fails with input + max_tokens > limit, surfacing the terminal Context limit reached instead of a clean compaction.

This adds a proxy-side gate that returns a Prompt is too long: N > M response before dispatching upstream once a request crosses a configurable token trigger. The client then compacts while there's still real headroom, so its summary request fits the window and succeeds quickly. No client changes needed.

Root cause

Prompt is too long: <actual> tokens > <limit> tokens is the error shape that drives the client's reactive compaction. Today it only appears once Kiro itself rejects the request near 1M — too late for the summary to fit. Firing it early, at a soft trigger below the real window, is the whole fix.

What changed

Configurable trigger: LLM_ACCESS_KIRO_COMPACT_TRIGGER_TOKENS (default 780000; 0/negative disables the gate — the model's real window still applies).
Accuracy via the anchor cache (the key design point): the gate must not threshold on the local count_all_tokens estimate alone — it drifts from Kiro's true contextUsage (bridge scaffolding, tokenizer differences, the 15k/ratio rule in resolve_input_tokens_with_threshold). So the real resolved input-token count is stored per conversation-prefix in the anchor index at record_success time and recovered pre-dispatch next turn; the gate uses max(local_estimate, recovered_real_prev).
No compaction deadlock: the anchor index is keyed by conversation prefix — after a compaction the prefix changes, recovery misses, and the gate falls back to the (now small) local count. Compaction-summary requests are additionally exempt (detected via the summary-instruction signature) so the reactive loop can never gate its own summary request.

Files

llm-access-kiro/cache_sim/anchor_index: ConversationAnchorEntry.real_input_tokens + get_real_input_tokens (peek, no recency bump).
llm-access-kiro/cache_sim/simulator: store on record_success_from_runtime_projection, read via recover_real_input_tokens_from_runtime_projection.
llm-access/provider/errors: trigger parser + kiro_proactive_compact_response builder + limit-parameterized message core.
llm-access/provider/kiro_dispatch: websearch-path gate (local estimate) + main-path gate (max(local, recovered real)) after anchor recovery, before the upstream call; request_is_compaction_summary detector.
llm-access/provider/stream_guards: compute + store resolved real input tokens at both record-success sites.

Verification

cargo clippy -p llm-access-kiro -p llm-access --all-targets -- -D warnings → clean
cargo test → llm-access-kiro 168, llm-access 287 (incl. 15 new: 4 anchor-index, 6 errors-config, 5 compaction-detector)
rustfmt on the 5 changed files only; deps/lance & deps/lancedb untouched

Notes for reviewers

LLM_ACCESS_KIRO_COMPACT_TRIGGER_TOKENS is read once via LazyLock; documented on the DEFAULT_KIRO_COMPACT_TRIGGER_TOKENS constant (matching how sibling provider tunables are documented — code-local, not in the runbook).
The recovered value is the previous turn's real consumption; max(...) with the current local estimate keeps the newest delta in play. First turn (no stored value) falls back to local — correct, since early turns are small.

🤖 Generated with Claude Code

…n usage Claude Code (and other Anthropic clients) only compact a conversation when the model's context window is nearly full. Against Kiro's 1M-context models that trigger lands at ~80-92% (~800k-920k tokens), at which point the reactive-compaction summary request — which itself carries the whole history — is so close to the real ceiling that it is slow and frequently fails with `input + max_tokens > limit`, surfacing the terminal "Context limit reached" instead of a clean compaction. This adds a proxy-side gate that returns a `Prompt is too long: N > M` response *before* dispatching upstream once a request crosses a configurable token trigger (default 780k via `LLM_ACCESS_KIRO_COMPACT_TRIGGER_TOKENS`; 0/negative disables). The client then compacts while there is still real headroom, so its summary request fits the window and succeeds quickly. No client changes needed. Accuracy: the gate must not threshold on the local `count_all_tokens` estimate alone — it drifts from Kiro's true contextUsage (bridge scaffolding, tokenizer differences, the 15k/ratio rule in `resolve_input_tokens_with_threshold`). So the real resolved input-token count is now stored per conversation-prefix in the anchor index at `record_success` time and recovered pre-dispatch next turn; the gate uses `max(local_estimate, recovered_real_prev)`. The anchor index is keyed by conversation prefix, so after a compaction the prefix changes, recovery misses, and the gate falls back to the (now small) local count — no deadlock. Compaction-summary requests are always exempt (detected via the summary-instruction signature) so the reactive loop can never gate its own summary request. Changes: - llm-access-kiro/cache_sim/anchor_index: ConversationAnchorEntry gains `real_input_tokens: Option<i32>`; `insert` takes it; `get_real_input_tokens` peeks it without bumping recency (+ 4 tests). - llm-access-kiro/cache_sim/simulator: `record_success_from_runtime_projection` stores the real value; `recover_real_input_tokens_from_runtime_projection` reads it. Test-only `record_success` passes None. - llm-access/provider/errors: `LLM_ACCESS_KIRO_COMPACT_TRIGGER_TOKENS` parser (default 780k), `kiro_proactive_compact_response` builder, limit-parameterized message core (+ 6 tests). - llm-access/provider/kiro_dispatch: websearch-path gate (local estimate) + main-path gate (max(local, recovered real)) after anchor recovery, before the upstream call; `request_is_compaction_summary` detector (+ 5 tests). - llm-access/provider/stream_guards: compute + store the resolved real input tokens at both (stream + non-stream) record-success sites. Verified: cargo clippy -D warnings on llm-access-kiro + llm-access; 168 + 287 tests pass; rustfmt on the 5 changed files; deps submodules untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a proactive auto-compaction gate for Kiro requests to trigger client-side conversation compaction before reaching the model's context window limit. It tracks and retrieves the previous turn's real input-token count using the conversation anchor index, configures a default trigger threshold of 780,000 tokens, and exempts compaction summary requests to prevent deadlocks. The review feedback highlights two key improvement opportunities: optimizing the compaction summary detection to avoid string allocations on large payloads by using an allocation-free case-insensitive search, and replacing duration_since with saturating_duration_since in the token retrieval logic to prevent potential panics from clock drift while also refactoring cache borrows.

gemini-code-assist · 2026-06-01T20:06:50Z

+fn text_is_compaction_summary_prompt(text: &str) -> bool {
+    let lower = text.to_ascii_lowercase();
+    lower.contains("detailed summary of") && lower.contains("conversation")
+}


The text_is_compaction_summary_prompt function is called recursively on every string value in the request payload via json_value_contains_compaction_summary. Calling text.to_ascii_lowercase() on every string block allocates a new String of the same size and copies the data. For large payloads (which can be up to 1M tokens / several megabytes of text), this causes significant CPU overhead and memory churn.

Since the search patterns (detailed summary of and conversation) are purely ASCII, we can perform a case-insensitive substring search without any allocations by using slice::eq_ignore_ascii_case on byte windows.

fn contains_ignore_ascii_case(haystack: &str, needle: &str) -> bool { if needle.is_empty() { return true; } haystack .as_bytes() .windows(needle.len()) .any(|window| window.eq_ignore_ascii_case(needle.as_bytes())) } fn text_is_compaction_summary_prompt(text: &str) -> bool { contains_ignore_ascii_case(text, "detailed summary of") && contains_ignore_ascii_case(text, "conversation") }

Fixed in 3d35b79. Replaced the per-string to_ascii_lowercase() allocation with an allocation-free contains_ignore_ascii_case (byte-window eq_ignore_ascii_case), exactly as suggested. Also tightened the scan to only the active request (last message + system) per the codex P2 below, so the haystack is bounded anyway.

gemini-code-assist · 2026-06-01T20:06:50Z

+    pub fn get_real_input_tokens(
+        &mut self,
+        anchor: &str,
+        now: Instant,
+        ttl: Duration,
+    ) -> Option<i32> {
+        let cache = self.cache.as_mut()?;
+        let entry = cache.peek(anchor)?;
+        if now.duration_since(entry.last_touched_at) > ttl {
+            cache.pop(anchor);
+            return None;
+        }
+        entry.real_input_tokens
+    }


The current implementation of get_real_input_tokens performs a peek lookup, borrows the entry, and then conditionally calls cache.pop which requires a mutable borrow of the cache. While NLL (Non-Lexical Lifetimes) can sometimes resolve this, it is safer and more idiomatic to extract the required values first to completely release the borrow on the cache before performing any mutation.

Additionally, using now.duration_since(entry.last_touched_at) can panic if now is slightly earlier than entry.last_touched_at due to subtle monotonic clock drifts or virtualization/testing environments. Using saturating_duration_since is a safer alternative to prevent potential panics.

pub fn get_real_input_tokens( &mut self, anchor: &str, now: Instant, ttl: Duration, ) -> Option<i32> { let cache = self.cache.as_mut()?; let (expired, real_tokens) = match cache.peek(anchor) { Some(entry) => ( now.saturating_duration_since(entry.last_touched_at) > ttl, entry.real_input_tokens, ), None => return None, }; if expired { cache.pop(anchor); None } else { real_tokens } }

Fixed in 3d35b79. recover_token_counts (renamed from get_real_input_tokens) now extracts (expired, token_counts) from the peek borrow first, releasing it before the conditional pop, and uses saturating_duration_since to avoid the monotonic-clock panic. (The entry now carries both real+local counts for the current-turn-delta fix below.)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab4eb27ddd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-01T20:10:37Z

+                    trigger,
+                    "proactively returning prompt-too-long to trigger client compaction"
+                );
+                return kiro_proactive_compact_response(effective_input_tokens, trigger);


Record proactive 413s before returning

When a normal Kiro request crosses the proactive compaction trigger, this branch returns a gateway-generated 413 before updating usage_meta or calling any usage recorder. That makes these intentional Prompt is too long responses disappear from the usage/error audit trail, unlike the nearby conversion preflight failures that call record_kiro_preflight_failure and upstream 413s that call record_kiro_usage; the websearch gate at line 246 has the same bypass. Capture the error metadata and record the rejection before returning so operators can see when the new gate fires.

Useful? React with 👍 / 👎.

Fixed in 3d35b79. Both gates (websearch + main path) now capture the error message/body and call record_kiro_preflight_failure before returning the 413, so the proactive rejection appears in the usage/error audit trail like the sibling conversion preflight failures. That recorder is audit-only (zero usage summary + status record, no cooldown/failover), which is the right semantics for an intentional gate.

chatgpt-codex-connector · 2026-06-01T20:10:37Z

+            let effective_input_tokens = request_input_tokens.max(recovered_real_tokens);
+            if effective_input_tokens >= trigger && !request_is_compaction_summary(&payload) {


Include the current turn when using recovered real tokens

For conversations where Kiro's real contextUsage is higher than the local estimate, max(request_input_tokens, recovered_real_tokens) can miss the turn that actually crosses the proactive threshold: e.g. a cached previous turn at 760k real tokens plus a new 30k+ user paste may still have a local total below 780k, so this branch dispatches upstream instead of triggering compaction. The stored value represents the previous turn, so the gate needs to add the current-turn delta (or otherwise compare a current real-token estimate), not just take the max, otherwise large follow-up turns can still hit Kiro's hard limit before the proxy fires.

Useful? React with 👍 / 👎.

Fixed in 3d35b79. The anchor now caches both the real and local counts per turn (AnchorTokenCounts), and the gate estimates the current real consumption as real_prev + max(0, local_now - local_prev), gated against max(local_now, estimated_real). So your 760k-real-prev + 30k-new-paste case now estimates ~790k and trips the 780k trigger even though local_now alone is below it. Covered by effective_tokens_adds_current_delta_to_recovered_real.

chatgpt-codex-connector · 2026-06-01T20:10:38Z

+    payload
+        .messages
+        .iter()
+        .any(|message| json_value_contains_compaction_summary(&message.content))


Limit compaction-summary detection to the active request

Because this scans every historical message, any earlier user/assistant text containing phrases like “detailed summary of ... conversation” marks all subsequent turns as compaction summaries and permanently exempts that conversation from the proactive gate. In that scenario a later large normal turn near the trigger will be sent upstream and can still hit Kiro's hard context failure; the exemption should be based on the current summary instruction (or another precise marker), not arbitrary old history.

Useful? React with 👍 / 👎.

Fixed in 3d35b79. request_is_compaction_summary now scans only the system prompt and the last message, not the whole history, so an earlier turn that quoted "detailed summary of ... conversation" no longer permanently exempts the conversation. Regression test earlier_history_summary_phrase_does_not_exempt_later_turn asserts a later normal turn after such history is NOT exempted.

…onfig Addresses the PR #24 review and migrates the proactive-compaction trigger from an env var to the Postgres-backed AdminRuntimeConfig so it is editable from the frontend admin UI (no redeploy). CR fixes: - (gemini, high) compaction-summary detection no longer allocates a lowercased copy of each string block; `contains_ignore_ascii_case` does an allocation-free `eq_ignore_ascii_case` byte-window search. MB-scale payloads no longer churn memory. - (gemini, med) `recover_token_counts` (was `get_real_input_tokens`) extracts values before the conditional `pop` (clean borrow release) and uses `saturating_duration_since` to avoid a monotonic-clock panic. - (codex, P2) proactive 413s are now recorded via `record_kiro_preflight_failure` (audit-only, no cooldown) on both gates, so the rejection shows in the usage/error trail like sibling preflight failures. - (codex, P2) the gate no longer uses `max(local, recovered_real)` which missed the current turn. The anchor now caches both real and local counts per turn; the gate estimates `real_prev + max(0, local_now - local_prev)` so a large follow-up turn still trips the trigger. - (codex, P2) compaction-summary detection scans only the active request (the last message + system), not all history, so an earlier turn that merely quoted the instruction can no longer permanently exempt the conversation. Config migration (env -> PG -> frontend): - `LLM_ACCESS_KIRO_COMPACT_TRIGGER_TOKENS` env var removed; replaced by `AdminRuntimeConfig.kiro_compact_trigger_tokens` (default 780000, 0 = off), propagated to `ProviderKiroRoute.compact_trigger_tokens` and the Valkey RouteSnapshot. Migration 0019 + 0001_init column (CHECK >= 0). Admin surface validates with `validate_max` (0 allowed, cap 1,000,000). Frontend api.rs + /admin/kiro-gateway gain a "Compaction Trigger Tokens (0 disables)" field. - The gate reads `route.compact_trigger_tokens` at both gate points. Verified: clippy -D warnings on core/store/migrations/kiro/llm-access; frontend wasm check; kiro 168 + core 10 + llm-access 291 tests pass (incl. new anchor/gate/admin tests); rustfmt on the 24 changed files; deps submodules untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

acking-you · 2026-06-01T22:29:45Z

CR addressed + trigger moved to PG runtime config (`3d35b79`)

All 5 review findings fixed (replied inline): allocation-free compaction detection (gemini high), recover_token_counts borrow + saturating_duration_since (gemini med), proactive-413 audit recording on both gates (codex P2), current-turn delta in the gate estimate (codex P2), and last-message+system-only detection (codex P2).

Beyond the CR — config now lives in Postgres, editable from the frontend. The LLM_ACCESS_KIRO_COMPACT_TRIGGER_TOKENS env var is gone; the trigger is now AdminRuntimeConfig.kiro_compact_trigger_tokens (default 780000, 0 disables), persisted in llm_runtime_config (migration 0019 + 0001_init column, CHECK >= 0), propagated to ProviderKiroRoute.compact_trigger_tokens + the Valkey RouteSnapshot, and exposed as a "Compaction Trigger Tokens (0 disables)" field on /admin/kiro-gateway. The gate reads route.compact_trigger_tokens at both gate points, so changes take effect without redeploy.

The current-turn-delta fix (codex P2) is what made the accuracy work concrete: the anchor caches both the real (contextUsage-derived) and local (count_all_tokens) counts per turn, and the gate estimates real_prev + max(0, local_now - local_prev).

Verified: clippy -D warnings on core/store/migrations/kiro/llm-access; frontend wasm check; kiro 168 + core 10 + llm-access 291 tests (incl. new anchor/gate/admin tests); rustfmt on the 24 changed files; deps submodules untouched.

gemini-code-assist Bot reviewed Jun 1, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 1, 2026

View reviewed changes

		let effective_input_tokens = request_input_tokens.max(recovered_real_tokens);
		if effective_input_tokens >= trigger && !request_is_compaction_summary(&payload) {

Conversation

acking-you commented Jun 1, 2026

Summary

Root cause

What changed

Files

Verification

Notes for reviewers

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

acking-you Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

acking-you Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

acking-you Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

acking-you Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

acking-you Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

acking-you commented Jun 1, 2026

CR addressed + trigger moved to PG runtime config (3d35b79)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CR addressed + trigger moved to PG runtime config (`3d35b79`)