Skip to content

feat(kb-open): Deep Research 开放 API(start/SSE/status/cancel)#446

Open
ncw1992120 wants to merge 4 commits into
mateaix:devfrom
ncw1992120:feat/kb-open-research
Open

feat(kb-open): Deep Research 开放 API(start/SSE/status/cancel)#446
ncw1992120 wants to merge 4 commits into
mateaix:devfrom
ncw1992120:feat/kb-open-research

Conversation

@ncw1992120

Copy link
Copy Markdown
Contributor

Closes #443 · Part of #440 · Builds on #441 (P0-A)

改动

异步 Deep Research 开放 API。Research 是多步 LLM 管线(plan → retrieve+draft → compose),异步执行并通过 SSE 推送进度。

4 个端点

方法 路径 说明
POST /{kbId}/research 启动(返回 sessionId + streamUrl)
GET /{kbId}/research/{id}/stream SSE 进度流(?token= 给 EventSource)
GET /{kbId}/research/{id}/status 查询状态 / 最终报告
POST /{kbId}/research/{id}/cancel 取消运行中的会话

组件

  • KbOpenResearchController:4 个端点,@RequireKbScope("kb:search")
  • KbResearchSessionRegistry:内存会话追踪,记录 keyId 归属(调用方只能查询/取消自己的会话)

安全

  • R7(SSE 鉴权)?token= query param(KbOpenApiAuthFilter 已支持此 fallback,EventSource 无法设 Authorization 头)
  • 会话归属:status/cancel/stream 均校验 keyId 匹配
  • 取消校验:会话必须 RUNNING(否则 409)

复用

底层完全复用现有 WikiResearchService.research() + ChatStreamTracker(SSE 广播),不改动 research 管线。

测试

Tests run: 23, Failures: 0, Errors: 0 (P0-A 17 + research 6)
BUILD SUCCESS
  • KbResearchSessionRegistryTest(6):register/complete/fail/cancel 生命周期、cancel-on-completed no-op、unknown session 返回 empty

依赖

此 PR 包含 P0-A 的 cherry-pick。若 #444(P0-A)先合并,rebase 后只剩 research 的 3 个文件。

…authz

Implements the authentication backbone for the KB Open API (mateaix#441):
API key lifecycle, a permitAll-path filter that rejects (never
pass-through), per-key sliding-window rate limiting, and a centralized
@RequireKbScope interceptor for scope + KB-ownership checks.

Components:
- TokenHashUtil: shared SHA-256 hash kernel (A4), reusable by PAT later
- KbApiKeyService: mint/authenticate/revoke/update + multi-KB binding
  (R3: empty binding = zero access, not "all KBs")
- KbOpenApiAuthFilter: sole gatekeeper for /api/v1/open/kb/** (R1: must
  return 401, no pass-through); R2: per-key rate limit (429)
- KbApiKeyRateLimiter: sliding-window limiter (TriggerRateLimiter pattern)
- @RequireKbScope + KbScopeInterceptor: centralized authorization (A1),
  scope check + kbId ownership from path variable
- KbApiKeyAdminController: JWT-authenticated CRUD (list/create/detail/
  update/revoke), workspace-scoped
- V162 migration (h2/mysql/kingbase): mate_kb_api_key + _binding tables

Security:
- mck_ prefix (distinct from PAT mc_ and JWT eyJ)
- SHA-256 hash storage, plaintext shown once at creation
- prefix column (4 chars) for UI display only

Tests (17 new, all green):
- KbApiKeyServiceTest: R3 empty-binding rejection, auth round-trip,
  expired/disabled/wrong-prefix rejection, kb:* wildcard, revoke
- KbApiKeyRateLimiterTest: sliding window, per-key isolation, recovery

Closes mateaix#441
@mateaix

mateaix commented Jun 28, 2026

Copy link
Copy Markdown
Owner

感谢 Deep Research 开放 API 🙏 鉴权这块做得很好:path 走 permitAll + KbOpenApiAuthFilter 单点 fail-closed,四个端点都带 @RequireKbScope("kb:search"),并且 jobId 的 IDOR 已经堵住——requireSessionOwnership 校验 session.keyId().equals(ctx.keyId()),A 既看不到也取消不了 B 的 session。SSE 管线(Utf8SseEmitter + 10min 超时 + onCompletion/onTimeout/onError detach)也正确。

但异步作业层有几个阻塞项,对一个公开且产生真实成本的端点很关键:

1. 取消并不会真正停止作业。
WikiResearchService.research() 是一条直通的同步流水线(plan → 并行 draft → compose),全程不查 streamTracker.isStopRequested(...)、也没有中断标志。cancel 端点只把 registry 状态翻成 CANCELLED 并广播 SSE 关闭,后台虚拟线程仍把 LLM/web 调用跑到底——「取消」既不省成本也不停算力。需要引入协作式取消信号(在各 research 阶段间检查停止标志,或持有 Future 并 interrupt)。

2. CANCELLED 会被 COMPLETED/FAILED 覆盖。
作业跑完后 sessionRegistry.complete(...)computeIfPresent)无条件改写状态,用户取消后再查 /status 会看到 completed 和完整报告。请让 CANCELLED 成为「粘性」终态,complete/fail 在已 CANCELLED 时 no-op。

3. session registry 无界增长(内存泄漏)。
KbResearchSessionRegistry.sessions 从不清理——无 TTL、无定时清理、完成也不移除,每个 session 活到 JVM 退出。请加 TTL/定时清理或容量上限。

4. 每个 key 没有在跑作业的并发上限(成本/DoS)。
filter 只限了「每分钟请求数」(默认 60/min,start 也被计入),但没限「同时在跑的 research 作业数」。一个 key 每分钟能拉起约 60 个多步 LLM 流水线、各自再 fan-out 子问题,全跑在无界的 newVirtualThreadPerTaskExecutor 上——这是公开端点最主要的成本爆炸/DoS 路径。建议加 per-key 在跑并发上限,超了返回 429。设计文档 §9/§10 自己也写了 research 要「带限流+计费」,token 计费(TokenUsageService)这里也还缺。

5. 内联全限定名SecurityConfig.javaWebMvcConfig.javaKbOpenResearchController.javanew java.util.LinkedHashMap<>())、测试里的 java.util.List.of(...),按规范改成 import + 简单名。

非阻塞: V162 迁移头注释写成 V161(且与 #437 撞号,配合 P0-A 顺延);research 复用 kb:search scope 但设计文档没给它分配 scope,建议补一行说明;kb-open-api-design.md 同样建议移出仓库根目录。

栈底 P0-A 改好后这个 PR rebase,并把上面 1–4 的作业生命周期/成本控制补上,我们再合并 🙏

BLOCKERS:
- prefix column VARCHAR(6) → VARCHAR(12) across all 3 migration dialects;
  KbApiKeyService.create() produces 8 chars (mck_ + 4 random), VARCHAR(6)
  would silently truncate on H2 and throw on MySQL strict mode
- Rename migration V162 → V164 to avoid collision with merged mateaix#437
  (V162=wiki_raw_material_error_code, V163=wiki_raw_material_warning)
  and fix stale V161 references in h2/kingbase comments

NITS:
- SecurityConfig/WebMvcConfig: replace inline FQN with import + simple name
- parseScopes: add .map(String::trim) so ' kb:read' matches correctly
- Remove ?token= SSE query fallback in KbOpenApiAuthFilter (P0-A has no
  SSE endpoint; key would leak into access/proxy logs — R5)
- Move kb-open-api-design.md from repo root to rfcs/ (contains RFC-090
  internal reference that would be exposed by sync-opensource)
- KbApiKeyEntity Javadoc: 'first 4 chars' → 'first 8 chars (mck_ + 4)'
  to match actual behavior
Implements the async Deep Research endpoint for the KB Open API (mateaix#443).
Research is a multi-step LLM pipeline (plan → retrieve+draft → compose)
that runs asynchronously and broadcasts progress via SSE.

Endpoints:
- POST /{kbId}/research                      start (returns sessionId + streamUrl)
- GET  /{kbId}/research/{id}/stream          SSE progress (?token= for EventSource)
- GET  /{kbId}/research/{id}/status          query status / final report
- POST /{kbId}/research/{id}/cancel          cancel running session

Components:
- KbOpenResearchController: 4 endpoints, @RequireKbScope("kb:search")
- KbResearchSessionRegistry: in-memory session tracking with keyId
  ownership (a caller can only query/cancel their own sessions)

Security:
- R7: SSE uses ?token= query param (KbOpenApiAuthFilter already supports
  this fallback for EventSource which can't set Authorization headers)
- Session ownership: status/cancel/stream all verify keyId match
- Cancel checks session is RUNNING (409 otherwise)

Reuses existing WikiResearchService.research() + ChatStreamTracker for
the actual research pipeline and SSE broadcasting.

Tests (6 new, all green):
- KbResearchSessionRegistryTest: register/complete/fail/cancel lifecycle,
  cancel-on-completed no-op, unknown session returns empty

Closes mateaix#443
…urrency cap

Review mateaix#446 — address all 4 job-lifecycle/cost blockers + nits:

1. Cooperative cancellation (was: cancel only flipped status, pipeline ran
   to completion). Cancel endpoint now calls streamTracker.requestStop();
   WikiResearchService.ensureNotCancelled() checks isStopRequested at each
   stage boundary (plan→draft, draft→compose) and inside the parallel draft
   fan-out — so cancel actually halts the expensive LLM calls, not just the
   SSE stream. Throws ResearchCancelledException (caught locally, no error
   broadcast).

2. Sticky CANCELLED terminal. complete()/fail() now no-op on a CANCELLED
   session, so a user who cancelled never sees a COMPLETED report surface
   via /status.

3. Session registry TTL. Terminal sessions get an updatedAt timestamp and
   are evicted by a @scheduled sweep after
   mate.kbopen.research.session-ttl (default 30m). RUNNING sessions are
   never evicted. Prevents unbounded memory growth.

4. Per-key concurrency cap. startIfAllowed() rejects new research when a
   key already has mate.kbopen.research.max-concurrent-per-key (default 3)
   RUNNING sessions → 429. Stops one key from spawning ~60 parallel
   multi-step LLM pipelines per minute under the per-min rate limiter.

5. Inline FQN → import (controller LinkedHashMap, test List.of).

Nits (inherited from P0-A rebase):
- V162→V164, prefix VARCHAR(12), design doc moved to rfcs/.
- Design doc: kb:search scope row now documents it covers /research/**.

31 tests pass (12 registry incl. sticky-cancel/concurrency/TTL +
13 service + 4 rate limiter + 4 controller + ...).
@ncw1992120 ncw1992120 force-pushed the feat/kb-open-research branch from 8d7894f to 5e216c9 Compare June 28, 2026 19:24
@ncw1992120

Copy link
Copy Markdown
Contributor Author

All 4 blockers + nits fixed. Rebased onto P0-A (inherits V164 + FQN + design doc move) and addressed each item in commit 5e216c9.

1. Cancel now actually stops the job (cooperative).
cancel calls streamTracker.requestStop(sessionId). WikiResearchService gained ensureNotCancelled(sessionId) which checks isStopRequested at three points: after plan, after draft (before compose), and inside the parallel draftStage fan-out before each draftOneSection LLM call. On cancel it throws ResearchCancelledException (caught locally — no error broadcast, since the user asked for it). So the expensive retrieve/draft/compose LLM calls are skipped, not run to completion.

2. CANCELLED is now a sticky terminal.
complete() / fail() no-op when status == CANCELLED. A late completion from the async thread cannot overwrite the cancel — /status keeps showing cancelled.

3. Session registry TTL eviction.
Sessions carry an updatedAt timestamp; a @Scheduled sweep (every 5 min) evicts terminal sessions older than mate.kbopen.research.session-ttl (default 30m). RUNNING sessions are never evicted. No more unbounded map growth.

4. Per-key concurrency cap.
startIfAllowed(keyId, ...) counts RUNNING sessions per key and throws TooManyConcurrentException at mate.kbopen.research.max-concurrent-per-key (default 3) → controller maps to 429. This sits on top of the per-minute rate limiter, so one key cannot spawn ~60 parallel multi-step pipelines. (Per-key token billing via TokenUsageService is still P1 — flagged in the design doc.)

5. Inline FQN → import.
Controller java.util.LinkedHashMapimport java.util.LinkedHashMap. Test java.util.List.ofimport java.util.List (both inherited V164/VARCHAR(12) fixes via rebase too).

Nits: V162→V164 comment, design doc moved to rfcs/, and the kb:search scope row in the doc now documents that it covers /research/** (Deep Research reuses the search scope).

31 tests pass: 12 registry (incl. sticky-cancel, per-key cap, TTL eviction) + 13 service + 4 rate limiter + 4 controller + fallback. Ready for re-review 🙏

@ncw1992120

Copy link
Copy Markdown
Contributor Author

补充一下 #446 的 rebase 计划:

review 的 4 个阻塞项 + nits 已在上一个 commit 5e216c9 全部改好(协作式取消、CANCELLED 粘性终态、TTL 清理、per-key 并发上限、FQN import),31 个测试通过。

但目前分支还建在旧的 P0-A 栈上。按你的建议,等 #445(P0-B)合并进 dev 后,我会把 #446 也 rebase 到 dev(同样用 --onto 丢掉 P0-A/P0-B 的旧 commit),届时分支会变回 mergeable。合并 #445 后告诉我一声,我立刻 rebase 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(kb-open): P1.5 Deep Research 开放(独立设计)

2 participants