Skip to content

fix: --version/--help flags + startup timing observability for #29 reconnect diagnostics#30

Open
kiki830621 wants to merge 3 commits intomainfrom
idd/29-bug-mcp-failed-to-reconnect-to-plugin-ch
Open

fix: --version/--help flags + startup timing observability for #29 reconnect diagnostics#30
kiki830621 wants to merge 3 commits intomainfrom
idd/29-bug-mcp-failed-to-reconnect-to-plugin-ch

Conversation

@kiki830621
Copy link
Copy Markdown
Contributor

Refs #29

Summary

/mcp 顯示 Failed to reconnect to plugin:che-telegram-mcp:telegram-all 的根因是 binary cold start ≥10s,觸發 Claude Code 的 reconnect timeout。本 PR 在 che-msg 範圍內提供:

  • Fast health-check entry point:--version / -v exit 0.008s warm,完全不進 TDLib init,讓 wrapper 與監控可低成本確認 binary 存活
  • Failure-path observability:env-gated CHE_TELEGRAM_LOG_STARTUP=1[startup] phase=N.NNNs 至 stderr,含 tdlib_init_FAILED / total_FAILED 路徑(verify F2 修復)
  • Explicit help:--help / -h 列出 stdio JSON-RPC 為 primary mode + env var 文件

Honest constraint(已在 issue + diagnosis 揭露):本 PR 提供觀測手段 + health-check 入口。直接修復 reconnect timeout — 那需要(a) wrapper repo 的 fast-path(psychquant-claude-plugins,sister concern)+(b) Claude Code reconnect timeout review(out of repo)+ 可能(c) binary cold-start 加速(物理限制 — 233MB universal Mach-O dynamic linking)。

Real production leverage:wrapper 安裝後可改用 $BINARY --version 替代 GitHub API curl 健檢。

Changes

Commit Scope
fe0013b S1+S2+S4: --version / -v / --help / -h flags + dispatcher + 10 unit tests
3856b3f S3: env-gated [startup] timing log in CheTelegramAllMCPServer.init
c384f6d verify-29 fix round: F1(subprocess test)+ F2(failure-path observability)+ F4(env predicate unit test)

Files(origin/main..HEAD,4 files / +353 / -7):

  • Sources/CheTelegramAllMCP/main.swift — argv switch dispatcher
  • Sources/CheTelegramAllMCPCore/CLIBootstrap.swift (NEW)CLIAction enum + CLIBootstrap parser + helpText + version constant
  • Sources/CheTelegramAllMCPCore/Server.swift — env-gated startup timing(do/catch for failure-path),shouldLogStartup testable predicate
  • Tests/CheTelegramAllMCPTests/CLIBootstrapTests.swift (NEW) — 19 tests(10 parser + 6 env predicate + 3 subprocess)

Verification

6-AI cross-model verification PASS(3 specialized Claude reviewers via pr-review-toolkit + Codex GPT-5.5 xhigh,共 4 cross-checks across 2 model families)。詳見 issue #29Verify Report

Verdict 0 P0 blocking · 3/4 P1 fixed in c384f6d · 1/4 P1 deferred(F3,reviewer 意見分歧)· 多個 P2/P3 documented as follow-ups
Tests 210 passed / 11 skipped(E2E credentials缺)/ 0 failures
End-to-end --version 0.008s warm(原 ≥10s)· env-gated startup log 雙向驗證(env=1 → 4 lines stderr · env unset → silent)

Checklist

  • Diagnose ✓ (comment)
  • Implementation Plan ✓ (comment)
  • Implement(3 commits)✓ (comment)
  • Verify ✓ (comment)
  • Pending: human review of this PR + /idd-close #29 after merge

Related / Follow-ups(documented in verify report,not auto-filed)

  • F3--versoin typo silent fallthrough(--prefix unknown flag rejection — API design choice)→ che-msg follow-up
  • F8 — helpText line-position assertion(test polish)
  • F9 — version literal unification(CLIBootstrap.versionServer.swift:46 ↔ wrapper DESIRED_VERSION)
  • Sister concerns from diagnosis(屬不同 repo):
    • psychquant-claude-plugins:wrapper version-match fast-path
    • psychquant-claude-plugins:wrapper stderr log file
    • Claude Code:reconnect timeout review
    • che-msg:TDLib lock detection

🤖 Generated by /idd-all Phase 5(PR mode unattended)。Do NOT add Closes #29 — IDD discipline requires manual /idd-close after merge to enforce checklist gate + closing summary。

kiki830621 added 3 commits May 8, 2026 07:15
Diagnostic finding (#29): /mcp reconnect failures correlate with the binary's
≥10s cold-start latency. Health-check callers (Claude Code's reconnect probe,
wrapper version checks, /mcp scan) currently pay full TDLib framework load
even when they only need the version string.

Solution scoped to che-msg: extract argv parsing into a pure CLIBootstrap
struct in CheTelegramAllMCPCore, dispatch in main.swift BEFORE any TDLib /
MCP server init. --version / -v now exit in 0.008s warm / 0.976s cold,
versus ≥10s for the original full-init path.

Caveats (per diagnosis): this fixes the health-check entry point, NOT the
underlying reconnect timeout. Wrapper-side fast-path (skip GitHub API curl
when version matches) and Claude Code reconnect timeout review remain as
sister concerns out of this repo's scope.

Changes:
- New: Sources/CheTelegramAllMCPCore/CLIBootstrap.swift — CLIAction enum
  (.showVersion / .showHelp / .runServer) + parse(_:) pure function
  + version constant + helpText (mentions stdio JSON-RPC primary mode +
  CHE_TELEGRAM_LOG_STARTUP env var for upcoming S3)
- Modified: Sources/CheTelegramAllMCP/main.swift — switch on
  CLIBootstrap.parse(CommandLine.arguments) before existing server flow
- New: Tests/CheTelegramAllMCPTests/CLIBootstrapTests.swift — 10 unit tests
  covering long/short flag forms, no-args / unknown-args fallthrough, empty
  argv defensiveness, and helpText content invariants (version constant +
  stdio mention + env var name)

Out of scope (will revisit in S3): startup timing log to stderr.
When CHE_TELEGRAM_LOG_STARTUP=1 is set, CheTelegramAllMCPServer.init emits
four `[startup]` lines to stderr breaking down its cold-start latency:

  [startup] tdlib_init=N.NNNs       # try await TDLibClient()
  [startup] tools_define=N.NNNs     # Self.defineTools()
  [startup] register_handlers=N.NNNs# await registerHandlers()
  [startup] total=N.NNNs            # whole init() block

Default behavior unchanged (env unset → silent). stderr is the correct
channel — stdout is reserved for MCP JSON-RPC traffic and any output there
would corrupt the protocol.

Why this is observability not fix (per #29 diagnosis): the reconnect failure
manifests as Claude Code timing out before binary's JSON-RPC `initialize`
response arrives (~10s on cold start in production). Without per-phase
breakdown we cannot tell if TDLib framework load is the bottleneck, or
something earlier (dynamic linker page-in of the 233MB universal binary).
This log gives whoever investigates next the data they need.

DispatchTime.now().uptimeNanoseconds chosen over Swift 6 ContinuousClock for:
  - simpler API (UInt64 subtraction vs. Duration.components)
  - no dependency on Swift 6 standard library refinements
  - well-defined wraparound semantics via &- (saturating subtract)

Helper logStartupDuration is file-scope (not method) to avoid leaking into
CheTelegramAllMCPServer's public API surface — it's a private diagnostic
detail.
…patcher coverage (#29)

verify-29 round (Agent Team + Codex GPT-5.5) found 4 P1 findings, 3 of
which are addressed here (1 deferred as user-visible CLI behavior change
requiring more deliberation):

F1 (pr-test-analyzer P1 + Codex P1) — dispatcher in main.swift was
untested. Pure parser unit tests don't catch a typo like swapping
print(version) for print(help), or exit(0) → exit(1). Added 3 subprocess
tests using Process API + Bundle(for:) path resolution to spawn the
built binary and assert exit code + stdout for --version, -v, --help.
Tests skip cleanly when binary not built. ~100ms per test, deterministic
(does not enter TDLib path).

F2 (silent-failure-hunter P1 + Codex P2) — TDLib init throw silenced the
[startup] tdlib_init line. The user reaching for CHE_TELEGRAM_LOG_STARTUP=1
is exactly the user diagnosing slow init that may also throw — failure
path observability is the higher-stakes case. Wrapped tdlib_init in
do/catch that emits tdlib_init_FAILED + total_FAILED before re-throwing.
tools_define and register_handlers don't throw so they keep simple
guarded form.

F4 (pr-test-analyzer P1) — env-gating predicate (== "1" check) was the
production safety mechanism but lived inline in init, not unit-testable.
Extracted as internal func shouldLogStartup(env:) accepting an env dict
explicitly. Added 6 unit tests pinning strict "1" contract: true on
"1", false on unset / "" / "0" / "true" / "TRUE". A typo like
== "0" would now be caught at PR time, not after merge floods stderr.

F3 (silent-failure-hunter P1, Codex did not flag) — DEFERRED. Reviewer
proposed adding .unknownFlag(String) enum case to reject typos like
--versoin. Disagreement between reviewers (code-reviewer accepted current
fallthrough as documented behavior). Deferring to follow-up because:
(a) it changes user-visible CLI behavior (existing harness scripts may
pass benign --debug flags expecting them to be ignored), (b) requires
choice on whether non-flag positionals (paths) should also be rejected.
Tracking inline in verify report.

Tests: 210 passed / 11 skipped (E2E creds) / 0 failures, up from 201.
- 6 new shouldLogStartup unit tests
- 3 new subprocess integration tests for --version/-v/--help

logStartupDuration visibility changed from private to internal so tests
in CheTelegramAllMCPTests target can reach shouldLogStartup helper from
the same file (Swift internal == module-wide).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant