feat(model+mem): retry with backoff for transient errors + fix in-memory map leaks by dyyz1993 · Pull Request #80 · KunAgent/Kun

dyyz1993 · 2026-06-08T04:25:37Z

Summary

Two independent improvements bundled in one PR (separate commits for easy cherry-picking).

Commit 1: `feat(model): add retry with backoff for 429/5xx and network errors`

The DeepSeek-compatible model client previously failed the entire turn on the first transient error (rate limit, server hiccup, network blip).

Changes (src/adapters/model/deepseek-compat-model-client.ts):

Exponential backoff with 25% jitter for HTTP 429 and 5xx responses
Honors the Retry-After header (both seconds and HTTP-date formats)
Retries network errors (fetch throws) before any chunk is yielded
Aborts cleanly if the turn is cancelled mid-backoff
Configurable via DeepseekCompatConfig.retry (maxAttempts, baseDelayMs, maxDelayMs)
Safety guarantee: retries only happen before any streaming chunk is yielded, so consumers never see partial-then-retried output

New config option:

retry?: {
  maxAttempts?: number   // default 3
  baseDelayMs?: number   // default 500
  maxDelayMs?: number    // default 8000
}

Commit 2: `fix(mem): bound in-memory maps with TtlLruCache, fix approval gate leak, clean telemetry on thread delete`

Four memory-safety fixes for long-running kun serve instances:

toolCatalogSnapshots (agent-loop.ts): Map → TtlLruCache(limit=64, ttl=30min). The old Map was written to on every modelStep but never had entries deleted.
promptTokenPressure (agent-loop.ts): Map → TtlLruCache(limit=128, ttl=10min). The only delete point was in compactIfNeeded; failed/aborted turns would leak entries.
InMemoryApprovalGate (in-memory-approval-gate.ts):
- decide() now deletes the entry instead of overwriting with a resolved value (was leaking resolved approvals forever)
- Pending approvals auto-expire after 10 min (rejecting the promise and cleaning up)
- New drainAllPending() method for startup/shutdown cleanup
Thread deletion (routes/index.ts): DELETE /v1/threads/:id now calls usageService.reset() to clean up per-thread telemetry (usage counter + cache telemetry).

Both TtlLruCache replacements are API-compatible with Map (verified all 6 call sites). TTL expiry degrades gracefully — a miss simply means "rebuild the snapshot" or "skip compaction this round", never data corruption.

Testing

173 tests pass (10 test files: model-client, ports, token-economy, cache, usage-service, atomic-write, contracts, domain, output-accumulator, read-json-body, http-server)
8 new tests added:
- PR1 (5 in model-client.test.ts): 429 retry success, Retry-After header honored, non-retryable errors (400), network error retry, abort during backoff
- PR2 (3 in ports.test.ts): post-decide entry cleanup, TTL auto-expiry rejects promise, drainAllPending clears all
Zero new TypeScript errors (verified via tsc --noEmit)
Zero new dependencies — reuses existing TtlLruCache from src/cache/
Zero breaking API changes — all new config options are optional with sensible defaults

Files changed

 src/adapters/in-memory-approval-gate.ts        |  52 +++-
 src/adapters/model/deepseek-compat-model-client.ts | 121 +++++++-
 src/loop/agent-loop.ts                         |  11 +-
 src/server/routes/index.ts                     |   6 +-
 tests/model-client.test.ts                     | 158 ++++++++++
 tests/ports.test.ts                            |  56 +++-
 6 files changed, 382 insertions(+), 22 deletions(-)

The DeepSeek-compatible model client previously failed the entire turn on the first transient error (rate limit, server hiccup, network blip). This adds exponential backoff with jitter, honors the Retry-After header, and aborts cleanly if the turn is cancelled mid-backoff. Retries only happen before any streaming chunk is yielded, so consumers never see partial-then-retried output. Adds 5 tests covering: 429 retry success, Retry-After header, non-retryable errors, network error retry, and abort during backoff.

…ak, clean telemetry on thread delete

…thod Move resolvers.delete() into cleanupEntry so all state (approvals, timers, resolvers) is cleaned in one place. Adjust decide and expire to grab the resolver reference before calling cleanupEntry.

XingYu-Zhong · 2026-06-08T13:46:32Z

这条 PR 当前目标分支是 master。请将 base branch 改为 develop 后再继续提审/合并，谢谢。

xuyingzhou added 3 commits June 8, 2026 12:21

fix(mem): bound in-memory maps with TtlLruCache, fix approval gate le…

40683ea

…ak, clean telemetry on thread delete

refactor(approval-gate): consolidate state cleanup in cleanupEntry me…

047dda6

…thod Move resolvers.delete() into cleanupEntry so all state (approvals, timers, resolvers) is cleaned in one place. Adjust decide and expire to grab the resolver reference before calling cleanupEntry.

XingYu-Zhong closed this Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(model+mem): retry with backoff for transient errors + fix in-memory map leaks#80

feat(model+mem): retry with backoff for transient errors + fix in-memory map leaks#80
dyyz1993 wants to merge 3 commits into
KunAgent:masterfrom
dyyz1993:fix/model-retry-and-memory-leaks

dyyz1993 commented Jun 8, 2026

Uh oh!

XingYu-Zhong commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dyyz1993 commented Jun 8, 2026

Summary

Commit 1: feat(model): add retry with backoff for 429/5xx and network errors

Commit 2: fix(mem): bound in-memory maps with TtlLruCache, fix approval gate leak, clean telemetry on thread delete

Testing

Files changed

Uh oh!

XingYu-Zhong commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Commit 1: `feat(model): add retry with backoff for 429/5xx and network errors`

Commit 2: `fix(mem): bound in-memory maps with TtlLruCache, fix approval gate leak, clean telemetry on thread delete`