Skip to content

feat(model+mem): retry with backoff for transient errors + fix in-memory map leaks#80

Closed
dyyz1993 wants to merge 3 commits into
KunAgent:masterfrom
dyyz1993:fix/model-retry-and-memory-leaks
Closed

feat(model+mem): retry with backoff for transient errors + fix in-memory map leaks#80
dyyz1993 wants to merge 3 commits into
KunAgent:masterfrom
dyyz1993:fix/model-retry-and-memory-leaks

Conversation

@dyyz1993

@dyyz1993 dyyz1993 commented Jun 8, 2026

Copy link
Copy Markdown

Summary

Two independent improvements bundled in one PR (separate commits for easy cherry-picking).

Commit 1: feat(model): add retry with backoff for 429/5xx and network errors

The DeepSeek-compatible model client previously failed the entire turn on the first transient error (rate limit, server hiccup, network blip).

Changes (src/adapters/model/deepseek-compat-model-client.ts):

  • Exponential backoff with 25% jitter for HTTP 429 and 5xx responses
  • Honors the Retry-After header (both seconds and HTTP-date formats)
  • Retries network errors (fetch throws) before any chunk is yielded
  • Aborts cleanly if the turn is cancelled mid-backoff
  • Configurable via DeepseekCompatConfig.retry (maxAttempts, baseDelayMs, maxDelayMs)
  • Safety guarantee: retries only happen before any streaming chunk is yielded, so consumers never see partial-then-retried output

New config option:

retry?: {
  maxAttempts?: number   // default 3
  baseDelayMs?: number   // default 500
  maxDelayMs?: number    // default 8000
}

Commit 2: fix(mem): bound in-memory maps with TtlLruCache, fix approval gate leak, clean telemetry on thread delete

Four memory-safety fixes for long-running kun serve instances:

  1. toolCatalogSnapshots (agent-loop.ts): MapTtlLruCache(limit=64, ttl=30min). The old Map was written to on every modelStep but never had entries deleted.

  2. promptTokenPressure (agent-loop.ts): MapTtlLruCache(limit=128, ttl=10min). The only delete point was in compactIfNeeded; failed/aborted turns would leak entries.

  3. InMemoryApprovalGate (in-memory-approval-gate.ts):

    • decide() now deletes the entry instead of overwriting with a resolved value (was leaking resolved approvals forever)
    • Pending approvals auto-expire after 10 min (rejecting the promise and cleaning up)
    • New drainAllPending() method for startup/shutdown cleanup
  4. Thread deletion (routes/index.ts): DELETE /v1/threads/:id now calls usageService.reset() to clean up per-thread telemetry (usage counter + cache telemetry).

Both TtlLruCache replacements are API-compatible with Map (verified all 6 call sites). TTL expiry degrades gracefully — a miss simply means "rebuild the snapshot" or "skip compaction this round", never data corruption.

Testing

  • 173 tests pass (10 test files: model-client, ports, token-economy, cache, usage-service, atomic-write, contracts, domain, output-accumulator, read-json-body, http-server)
  • 8 new tests added:
    • PR1 (5 in model-client.test.ts): 429 retry success, Retry-After header honored, non-retryable errors (400), network error retry, abort during backoff
    • PR2 (3 in ports.test.ts): post-decide entry cleanup, TTL auto-expiry rejects promise, drainAllPending clears all
  • Zero new TypeScript errors (verified via tsc --noEmit)
  • Zero new dependencies — reuses existing TtlLruCache from src/cache/
  • Zero breaking API changes — all new config options are optional with sensible defaults

Files changed

 src/adapters/in-memory-approval-gate.ts        |  52 +++-
 src/adapters/model/deepseek-compat-model-client.ts | 121 +++++++-
 src/loop/agent-loop.ts                         |  11 +-
 src/server/routes/index.ts                     |   6 +-
 tests/model-client.test.ts                     | 158 ++++++++++
 tests/ports.test.ts                            |  56 +++-
 6 files changed, 382 insertions(+), 22 deletions(-)

xuyingzhou added 3 commits June 8, 2026 12:21
The DeepSeek-compatible model client previously failed the entire turn
on the first transient error (rate limit, server hiccup, network blip).
This adds exponential backoff with jitter, honors the Retry-After header,
and aborts cleanly if the turn is cancelled mid-backoff. Retries only
happen before any streaming chunk is yielded, so consumers never see
partial-then-retried output.

Adds 5 tests covering: 429 retry success, Retry-After header, non-retryable
errors, network error retry, and abort during backoff.
…thod

Move resolvers.delete() into cleanupEntry so all state (approvals,
timers, resolvers) is cleaned in one place. Adjust decide and expire
to grab the resolver reference before calling cleanupEntry.
@XingYu-Zhong

Copy link
Copy Markdown
Collaborator

这条 PR 当前目标分支是 master。请将 base branch 改为 develop 后再继续提审/合并,谢谢。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants