Skip to content

bootstrap: Engage Claude Code auto-compaction for third-party models#70

Merged
brycelelbach merged 1 commit into
brycelelbach:mainfrom
robobryce:add/auto-compact-window
Jun 7, 2026
Merged

bootstrap: Engage Claude Code auto-compaction for third-party models#70
brycelelbach merged 1 commit into
brycelelbach:mainfrom
robobryce:add/auto-compact-window

Conversation

@robobryce

@robobryce robobryce commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

What

Tag the model name with a [1m] suffix and pin CLAUDE_CODE_AUTO_COMPACT_WINDOW to the model's true context window, in the third-party-deepseek and third-party-nemotron launcher arms:

  • DeepSeek V4 Pro → window 1000000, auto-compaction at ~967K
  • Nemotron 3 Ultra → window 262144 (its true limit), auto-compaction at ~229K

Why

In a local (non-remote) session, Claude Code only fires its auto-compact trigger when the model's context window resolves to a known size. For a model name it doesn't recognize, the window falls back to a 200K default whose source the trigger gate rejects — so auto-compaction never runs and the conversation grows unbounded until the provider rejects the request with a fatal ContextWindowExceededError.

Observed in a real run: a Nemotron 3 Ultra session grew monotonically across 533 turns with zero compaction, then died with:

API Error: 400 litellm.ContextWindowExceededError: ...
"This model's maximum context length is 262144 tokens.
 However, your messages resulted in 268481 tokens."
model=nvidia/nvidia/nemotron-3-ultra

How

Two coupled levers, verified against the installed Claude Code 2.1.168 binary and a mock Anthropic endpoint:

  1. [1m] suffix on the model name makes Claude Code resolve the model's window to 1M (tv() matches /[1m]/i). Without it, the window — and therefore the auto-compact ceiling — is clamped to the 200K default regardless of CLAUDE_CODE_AUTO_COMPACT_WINDOW, and the trigger stays disabled. Critically, Claude Code strips [1m] from the model name before the request, so the gateway still receives the real model id. Confirmed against a mock endpoint: ANTHROPIC_MODEL=deepseek-reasoner[1m] sends model=deepseek-reasoner on the wire (and nvidia/nvidia/nemotron-3-ultra[1m] sends nvidia/nvidia/nemotron-3-ultra).

  2. CLAUDE_CODE_AUTO_COMPACT_WINDOW pinned to the true window then sets where compaction fires: window − ~20K (output reserve) − ~13K (buffer), the same formula a first-party model uses. So DeepSeek (1M) compacts at ~967K and Nemotron (262,144) at ~229K — each ~33K below its hard limit.

This addresses @brycelelbach's review: compaction fires before the window, not at it, with the same margin a 1M-context Opus model gets (which auto-compacts at ~967K, not at 1M). The [1m] tag also attaches the context-1m-2025-08-07 anthropic-beta header; these gateways already receive a stack of beta headers and ignore unknown ones, so it's additive.

Test plan

  • ./test.bash --lintbash -n + shellcheck clean.
  • ./test.bash --unit — all pass. The deepseek/nemotron launcher tests assert the [1m]-tagged ANTHROPIC_MODEL (model=deepseek-reasoner[1m], model=nvidia/nvidia/nemotron-3-ultra[1m]) and auto_compact_window=1000000 / 262144 in the generated launcher env.
  • ./test.bash --docker — full bootstrap.bash in a fresh ubuntu:22.04; All e2e assertions passed. / === docker e2e passed ===.
  • Wire-level check against a mock Anthropic endpoint — drove the real claude binary with the launcher's env; captured request shows model=deepseek-reasoner (the [1m] suffix stripped) while Claude Code resolves the 1M window. Confirms the gateway receives a valid model id.
  • ./test.bash --secrets — gitleaks, no leaks found (working tree + history).
  • ./test.bash --e2e — covered by --docker (same assertions, isolated container); no host-only behavior in this change.

🤖 Generated with Claude Code

Comment thread tests/bootstrap.bats
grep -Fxq 'auth_token=deepseek-test-key' "$TEST_HOME/claude-launcher-env"
grep -Fxq 'model=deepseek-reasoner' "$TEST_HOME/claude-launcher-env"
grep -Fxq 'debug=1' "$TEST_HOME/claude-launcher-env"
grep -Fxq 'auto_compact_window=1000000' "$TEST_HOME/claude-launcher-env"

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. If DeepSeek V4 Pro's context window is 1 million, we shouldn't compact when we reach 1 million, right? We should do it just before hand.

For 1 million context window Opus models, at what point does Claude Code autocompact? I just want the answer to be consistent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you're right that compacting at the window would be too late, and chasing this down turned up a second problem with the original values. I verified the behavior against the installed Claude Code 2.1.168 binary (the cnwh$/hE4 path).

For a 1M-context Opus model, auto-compaction fires at ~967,000 tokens, not at 1,000,000. The trigger is window − ~20K (output reserve) − ~13K (buffer). So the value is the window size, not the compaction point — there's automatic headroom. (The docs mention a "~95%" trigger via CLAUDE_AUTOCOMPACT_PCT_OVERRIDE, but that percentage path is off by default; the real default is the fixed ~33K headroom.) Even taken at face value, 1000000 would have compacted at ~967K, not at 1M.

But the bigger issue: the original values never took effect. Claude Code clamps the window to min(tv(model), value), where tv(model) returns 1M only for names matching /[1m]/i, the beta-header path, or first-party opus-4-7/4-8. deepseek-reasoner and nvidia/nvidia/nemotron-3-ultra match none of those, so they fall to the 200K default — meaning 1000000 and 240000 were both silently clamped to a 200K window already.

So the real job of this env var here isn't to set 1M — it's to give the window an explicit env source, which is what flips the local-session trigger gate from "skip" to "engage." That's the actual fix for the unbounded-growth failure.

I've changed both arms to 200000 — the honest effective ceiling. Compaction now fires at ~167K, well under Nemotron's 262,144 limit, using the same window − reserve − buffer formula a 1M Opus model uses. To genuinely give DeepSeek a ~967K compact point we'd need Claude Code to recognize its true window, which there's no third-party knob for (the [1m] tag is a first-party alias; appending it to a gateway model id would just send a bogus name). Full breakdown is in the updated PR description.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — and I've raised it to Nemotron's true limit. Pushed an update that changes the approach so the configured window actually takes effect.

The catch I hit before: CLAUDE_CODE_AUTO_COMPACT_WINDOW is clamped to min(model_window, value), and an unrecognized third-party model name resolves to the 200K default — so any value above 200K was silently capped. The fix is to tag the model name with a [1m] suffix, which makes Claude Code resolve the full window. I verified against a mock endpoint that Claude Code strips [1m] before the request, so the gateway still receives the real id (deepseek-reasoner, nvidia/nvidia/nemotron-3-ultra).

With the clamp lifted, I set each window to the model's true limit:

  • DeepSeek1000000 → compaction at ~967K
  • Nemotron262144 (its real limit, up from the earlier 200K) → compaction at ~229K

262144 is the principled maximum for Nemotron: compaction fires at window − ~33K (a ~20K output reserve + ~13K buffer — the same margin a first-party model uses), so 262144 lands the trigger at ~229K with a full 33K of headroom under the 262,144 hard limit. Going higher would push the trigger past the limit and risk the ContextWindowExceededError this is meant to prevent.

@robobryce robobryce force-pushed the add/auto-compact-window branch from aa20c8d to 320930b Compare June 7, 2026 17:56
A local (non-remote) Claude Code session only fires the auto-compact
trigger when the model's context window resolves to a known size — and for
a model name Claude Code does not recognize it falls back to a 200K default
whose source the trigger gate rejects. So for third-party models
(deepseek-reasoner, nvidia/nvidia/nemotron-3-ultra) auto-compaction never
runs and the conversation grows until the provider rejects the request with
a fatal ContextWindowExceededError.

Observed in a real run: a Nemotron 3 Ultra session grew monotonically
across 533 turns with zero compaction, then died at ~268K tokens against
its 262,144 limit.

Tag the model name with a "[1m]" suffix in the deepseek and nemotron
launcher arms and pin CLAUDE_CODE_AUTO_COMPACT_WINDOW to the model's true
context window. The "[1m]" suffix makes Claude Code resolve the full window
(without it the window — and therefore the auto-compact ceiling — is capped
at the 200K default) and engages the trigger; Claude Code strips the suffix
from the model name before the request, so the gateway still receives the
real model id. Verified against a mock Anthropic endpoint: with
ANTHROPIC_MODEL=deepseek-reasoner[1m] the wire request carries
model=deepseek-reasoner.

  - DeepSeek V4 Pro -> window 1000000, compaction at ~967K.
  - Nemotron 3 Ultra -> window 262144 (its true limit), compaction at ~229K.

Compaction fires ~33K below the window (a ~20K output reserve plus a ~13K
buffer), the same window-minus-reserve margin a first-party model uses, so
it triggers before either provider's hard limit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@robobryce robobryce force-pushed the add/auto-compact-window branch from 320930b to ec4127b Compare June 7, 2026 18:18
@brycelelbach brycelelbach merged commit 22ba83e into brycelelbach:main Jun 7, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants