Skip to content

feat(engine): env override for max_output_tokens on tight providers#2147

Open
h3c-hexin wants to merge 1 commit into
Hmbown:mainfrom
h3c-hexin:feat/max-output-tokens-env
Open

feat(engine): env override for max_output_tokens on tight providers#2147
h3c-hexin wants to merge 1 commit into
Hmbown:mainfrom
h3c-hexin:feat/max-output-tokens-env

Conversation

@h3c-hexin
Copy link
Copy Markdown
Contributor

@h3c-hexin h3c-hexin commented May 26, 2026

Problem

effective_max_output_tokens decides the max_tokens we send in each API request. For models not in the context_window_for_model table it falls back to a 128K window assumption and returns API_MAX_OUTPUT_TOKENS (64K).

That heuristic is fine for the hosted CodeWhale/DeepSeek API but immediately HTTP 400s on self-hosted providers with a tight --max-model-len. Concrete failure:

  • vLLM serving Qwen3.6 with --max-model-len 65536
  • Request input ≈ 1.5K tokens + the requested 64K output ⇒ provider rejects (total > 65536 by 1 token).

Operators today have no way to lower the requested output without forking the engine, because the cap is a code-internal constant rather than config.

Fix

A short escape hatch at the top of effective_max_output_tokens: if DEEPSEEK_MAX_OUTPUT_TOKENS is set to a positive integer, return it directly. Otherwise existing behavior is unchanged.

if let Ok(raw) = std::env::var("DEEPSEEK_MAX_OUTPUT_TOKENS")
    && let Ok(n) = raw.trim().parse::<u32>()
    && n > 0
{
    return n;
}

Typical use: DEEPSEEK_MAX_OUTPUT_TOKENS=16384 for a 64K-window vLLM deployment.

Scope

  • One function in crates/tui/src/core/engine/context.rs, +14/-0 of behavior.
  • No new config struct field. Env-only override keeps the public API unchanged, matching the existing DEEPSEEK_* knob convention (DEEPSEEK_CAPACITY_*, DEEPSEEK_BASE_URL, ...).
  • Composes with fix(engine): keep auto-compaction working on sub-500K self-hosted windows #2060: that PR fixed the input budget math for sub-500K windows; this one lets operators cap the output request to fit the provider's hard limit.
  • Does not touch the trust-boundary surface (no auth / sandbox / publishing / prompts).

Tests

Two new unit tests in crates/tui/src/core/engine/tests.rs:

  • effective_max_output_tokens_env_override_returns_positive_value — env=16384 applies regardless of model (V4 hosted, V4 flash, sub-500K self-hosted).
  • effective_max_output_tokens_env_override_rejects_zero_and_invalid — env=`0` / `abc` / `` / ` ` / `-1` all fall through to the heuristic (no silent zero cap on typo'd config).

Three pre-existing tests that internally call effective_max_output_tokens now acquire lock_test_env() so they stay isolated from the new env-mutating tests:

  • context_budget_reserves_output_and_headroom
  • effective_max_output_tokens_caps_api_request_for_large_window_models
  • internal_context_budget_tiers_reserved_output_by_window

```
cargo test -p codewhale-tui --bin codewhale-tui -- effective_max_output_tokens context_budget context_input_budget internal_context_budget
```

All 6 pass. Full `cargo test --workspace --all-features` is green except for one pre-existing failure (`prompts::tests::system_prompt_skips_locale_preamble_for_english`) that also reproduces on a clean `origin/main` checkout and is unrelated to this change.

🤖 Generated with Claude Code

Greptile Summary

This PR adds a runtime escape hatch — a new env var — that lets operators cap max_tokens in API requests when running against self-hosted providers (vLLM/SGLang) whose --max-model-len is tighter than the engine's default 64 K output assumption. Existing behavior is completely unchanged when the var is unset.

  • effective_max_output_tokens gains an early-return for positive-integer values of the env var, with the existing heuristic as fallback for zero, non-numeric, or unset values.
  • Three pre-existing tests are updated to hold the process-wide env lock for isolation; two new tests cover the override and the rejection of zero/invalid values.

Confidence Score: 4/5

Safe to merge with one fix: the env override is not clamped, so a misconfigured value above the model's context window can silently disable compaction for sub-500K deployments.

The override correctly falls through on zero/invalid input and leaves V4 models unaffected. The risk is that context_input_budget reuses effective_max_output_tokens for sub-500K models, so an env value larger than the window causes a checked_sub underflow, returning None and disabling compaction without any error. A one-line .min(API_MAX_OUTPUT_TOKENS) clamp would close this gap.

crates/tui/src/core/engine/context.rs — the new early-return in effective_max_output_tokens and its downstream effect on context_input_budget

Important Files Changed

Filename Overview
crates/tui/src/core/engine/context.rs Adds DEEPSEEK_MAX_OUTPUT_TOKENS env-var early-exit to effective_max_output_tokens; no upper-bound guard against values that exceed the model's context window
crates/tui/src/core/engine/tests.rs Adds ScopedDeepSeekMaxOutputTokens RAII guard with proper lock-serialised env cleanup; three existing tests acquire lock_test_env() for isolation; two new tests cover the override path

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["effective_max_output_tokens(model)"] --> B{"env var set and > 0?"}
    B -- Yes --> C["return env value (new path)"]
    B -- No --> D["context_window_for_model(model)"]
    D -- Some --> E{"window >= 500K?"}
    D -- None --> F["fallback: 128K window"]
    F --> E
    E -- Yes --> G["return API_MAX_OUTPUT_TOKENS (64K)"]
    E -- No --> H["return min(window/2, 64K)"]
    C --> I["API request: max_tokens = env value"]
    G --> I
    H --> I
    J["context_input_budget(model)"] --> K{"window >= 500K?"}
    K -- Yes --> L["reserved = TURN_MAX_OUTPUT_TOKENS (262K)"]
    K -- No --> M["reserved = effective_max_output_tokens(model) -- affected by env var"]
    L --> N["budget = window - reserved - 1024"]
    M --> N
    N --> O{"underflow?"}
    O -- Yes --> P["None: compaction disabled"]
    O -- No --> Q["Some(budget): compaction active"]
Loading

Fix All in Codex Fix All in Claude Code Fix All in Cursor

Reviews (1): Last reviewed commit: "feat(engine): allow DEEPSEEK_MAX_OUTPUT_..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

…-context providers

The `effective_max_output_tokens` heuristic defaults to 64K for any model
not in the known-context-window table. This is fine for DeepSeek's hosted
API (1M context) but causes immediate HTTP 400s on self-hosted providers
with tight `max-model-len`.

Example: vLLM serving Qwen3.6 with `--max-model-len 65536` rejects
requests because 64000 (output) + ~1500 (input) exceeds the limit by 1
token.

This change lets the operator set `DEEPSEEK_MAX_OUTPUT_TOKENS=16384` (or
whatever fits their deployment) to override the heuristic. The env var
takes precedence over the model-table lookup when set to a positive
integer; otherwise the existing behavior is preserved.

No new config struct field — env-only override keeps the public API
unchanged. Useful for embedded users (e.g. pinvou3) who need to control
output budget without forking the engine config schema.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an environment variable override DEEPSEEK_MAX_OUTPUT_TOKENS to allow manual configuration of the maximum output tokens, which is particularly useful for self-hosted providers. It also adds corresponding unit tests that safely manage the environment state. The reviewer identified a potential issue where setting this override to a value larger than the model's context window could cause an underflow in the input budget calculation, and suggested capping the override value to prevent this safety risk.

Comment on lines +40 to 46
if let Ok(raw) = std::env::var("DEEPSEEK_MAX_OUTPUT_TOKENS")
&& let Ok(n) = raw.trim().parse::<u32>()
&& n > 0
{
return n;
}
let window = context_window_for_model(model).unwrap_or(128_000);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If DEEPSEEK_MAX_OUTPUT_TOKENS is configured to a value that is too large relative to the model's context window, it will cause context_input_budget to underflow and return None. This silently disables all preflight and emergency recovery paths, which is a significant safety risk.

To prevent this, we should retrieve the model's context window first, and then cap the user-provided override to ensure we always leave enough room for the headroom (CONTEXT_HEADROOM_TOKENS) and at least 1 token for input.

    let window = context_window_for_model(model).unwrap_or(128_000);
    if let Ok(raw) = std::env::var("DEEPSEEK_MAX_OUTPUT_TOKENS")
        && let Ok(n) = raw.trim().parse::<u32>()
        && n > 0
    {
        let max_allowed = window.saturating_sub(CONTEXT_HEADROOM_TOKENS as u32 + 1);
        return n.min(max_allowed).max(1);
    }

Comment on lines +40 to +44
if let Ok(raw) = std::env::var("DEEPSEEK_MAX_OUTPUT_TOKENS")
&& let Ok(n) = raw.trim().parse::<u32>()
&& n > 0
{
return n;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 DEEPSEEK_* prefix misleads non-DeepSeek operators

The env var controls max_tokens for any model routed through the engine — a Qwen3 or Llama operator using vLLM would not intuitively reach for DEEPSEEK_MAX_OUTPUT_TOKENS to solve their tight-context problem. The DEEPSEEK_* convention makes sense for vendor-specific keys (DEEPSEEK_BASE_URL, DEEPSEEK_CAPACITY_*), but this knob is provider-agnostic. Was a more generic name like CODEWHALE_MAX_OUTPUT_TOKENS considered?

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fix in Codex Fix in Claude Code Fix in Cursor

Comment on lines +40 to +45
if let Ok(raw) = std::env::var("DEEPSEEK_MAX_OUTPUT_TOKENS")
&& let Ok(n) = raw.trim().parse::<u32>()
&& n > 0
{
return n;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing upper-bound guard can silently disable compaction

context_input_budget calls effective_max_output_tokens as the reserved_output term for sub-500K models. If the env override is set above window − CONTEXT_HEADROOM_TOKENS (e.g. 70000 on a 65 536-token vLLM deployment), checked_sub underflows to None, silently disabling all compaction and preflight checks for that session. Adding .min(API_MAX_OUTPUT_TOKENS) before returning n caps the override at the same ceiling already used by the heuristic path.

Fix in Codex Fix in Claude Code Fix in Cursor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant