feat(auth): oauth install/callback handshake + b24oauth factory (PR-2c) by IgorShevchik · Pull Request #216 · bitrix24/templates-mcp

IgorShevchik · 2026-06-05T12:47:31Z

Summary

PR-2c lands the user-visible half of the OAuth multi-tenant flow:

/api/oauth/install — portal allow-list, state nonce, CSRF cookie, redirect to Bitrix24 authorize.
/api/oauth/callback — state verification (4 §8 CSRF bindings), code-to-token exchange, Bearer mint, HTML page that shows the Bearer once.
/api/oauth/_health — operator-tier counts endpoint with fails-closed authentication (loopback 127.0.0.0/8 OR NUXT_BITRIX24_OAUTH_ADMIN_TOKEN, never the agent token).
useBitrix24OAuth(memberId, userId) — per-tenant B24OAuth factory with LRU(100) caching + refresh-on-expiry via setCustomRefreshAuth + markRefreshFailed on invalid_grant.
Dispatcher (useBitrix24Tenant()) now routes to the factory when the flag is on; tools that PR-2d migrated need ZERO further code changes.
pruneExpiredStates scheduler wired into the Nitro schema-bootstrap plugin (closes PR-2c: wire pruneExpiredStates scheduler (5-min interval) #211).
Logger-redactor extended with OAuth secret patterns — landed FIRST per OAUTH-DESIGN.md §11 ordering invariant, so no OAuth log line can leak.
getRequestId(): string throwing accessor (closes PR-2c: requestId — populate unconditionally + getRequestId() throwing helper #214) — forward-compat anchor for the Bearer middleware that completes the wire-up.

Default-off-safe end-to-end: with NUXT_BITRIX24_OAUTH_ENABLED=false (the production default) the new endpoints respond with 503 FLAG-OFF, no /data directory is touched, the existing webhook flow is byte-identical.

What's in

Logger-redactor OAuth extension → getRequestId() accessor → install route → _health + scheduler → callback route → B24OAuth factory → dispatcher wiring → manual-QA scripts. Plus four review rounds of fixes (CI green-up, then security/correctness hardening). See the commit history; on squash-merge this collapses to one.

652 unit tests passed (+113 vs main), typecheck + eslint + shellcheck + zizmor clean.

Manual QA — run these scripts before merging

pnpm dev   # or docker compose up
./scripts/manual-qa-pr2c.sh        # Linux / macOS / WSL / Git-Bash
.\scripts\manual-qa-pr2c.ps1       # Windows native PowerShell

Auto-detects the running scenario and prints PASS/FAIL per assertion:

Scenario A — NUXT_BITRIX24_OAUTH_ENABLED=false (default): /api/health still 200, OAuth endpoints refuse with FLAG-OFF. The merge-readiness smoke — proves byte-identical-to-webhook.
Scenario B — flag on + configured: portal allow-list rejects junk, install → 302 + CSRF cookie (HttpOnly + SameSite=Lax), callback gates (PARAMS-MISSING / STATE-MISSING / STATE-EXPIRED), _health 200 from localhost.
Scenario C — flag on but CLIENT_ID/REDIRECT_URL missing: names the missing variable.

Full Bitrix24 round-trip (real Marketplace app + portal + user) is out of script scope — covered by callback.test.ts which mocks the token exchange via vi.stubGlobal('fetch', ...).

What's NOT in (deferred, all tracked)

Item	Why deferred	Tracked
Bearer middleware (`Bearer → findByBearerHash → runWithTenant`) on `/mcp`	Needs `defineMcpHandler({ middleware })` integration — its own review round. The last wire that makes a minted Bearer authenticate.	#217
Migrate ~30 tool tests to `vi.mock('~/server/utils/bitrix24-tenant')`	Mechanical, deserves its own PR	#215
Final operator docs (README OAuth-first, `docs/DEPLOYMENT.md`)	Batched	PR-5

Without the Bearer middleware, /mcp still uses NUXT_MCP_AUTH_TOKEN. An operator can flip the flag, complete /install → /callback, and get a Bearer — but /mcp doesn't accept it until #217. The callback HTML page warns the user about this explicitly ("⚠ Not active yet").

Refs / closes

✅ PR docs(auth): add oauth 2.0 + multi-tenant design doc (draft, no code) #58 (design), feat(auth): wire oauth scaffolding — ALS, tenant dispatcher, env (PR-2a) #209 (PR-2a scaffold), feat(security): oauth token store — sqlite-backed CRUD + audit-first invariant (PR-2b) #210 (PR-2b token store), refactor(auth): swap tool catalogue to useBitrix24Tenant + §11 observability contract (PR-2d) #213 (PR-2d tool swap) — all in main.
Closes PR-2c: wire pruneExpiredStates scheduler (5-min interval) #211 (pruneExpiredStates scheduler), PR-2c: requestId — populate unconditionally + getRequestId() throwing helper #214 (getRequestId accessor).
Drives oauth(P1-pre-enterprise): JSONL audit log for every token mutation (oauth_tokens, mcp_tokens) #61 (OAuth audit-log call sites — now emitted through real handlers).
Follow-ups: PR-2c-bearer: wire OAuth Bearer middleware into /mcp (the last connection) #217 (Bearer middleware — load-bearing), Migrate ~30 tool tests to mock useBitrix24Tenant (not useBitrix24) #215 (test migration).

Changelog framing (no BREAKING CHANGE)

Feature → Auth: OAuth install / callback / health endpoints + per-tenant B24OAuth factory with refresh-on-expiry, all behind a default-off flag.
Security: logger-redactor scrubs OAuth secrets (URL + JSON-body shapes); /api/oauth/_health fails closed without operator-tier auth, reads the raw socket IP (no X-Forwarded-For spoof of the localhost gate), and never exposes the DB path; CSRF cookie compared in constant time; Bitrix24 member_id regex-validated before it becomes a SQLite key; callback HTML escapes all interpolated values.
Test: scripts/manual-qa-pr2c.{sh,ps1} for operator pre-merge verification.

§11 of docs/OAUTH-DESIGN.md mandates this commit lands FIRST inside PR-2c — before any oauth.* handler emits a structured log line. Reversing the order means a reviewer can be asked to accept a `logger.info('oauth.callback.ok', { url })` call while the redactor still passes `?code=…` through unchanged. The ordering removes that class of mistake by making the redactor load-bearing at the first caller. What this commit adds (and nothing else): - `OAUTH_URL_RE` — catches `?code=…`, `?refresh_token=…`, `?access_token=…`, `?client_secret=…` in URL query-string position. Preserves the key name (operator forensics) and replaces only the value with `<REDACTED>`. - `OAUTH_JSON_RE` — catches `"access_token":"…"`, `"refresh_token":"…"`, `"client_secret":"…"` in JSON-stringified payloads. Closes the fixture-shape-4 leak in §11 (`logger.error('…', { body: JSON.stringify(res) })`). The OAuth authorization `code` is intentionally OMITTED from the JSON regex — `"code"` is too common a key name (HTTP status, error codes) and would over-redact; the URL-shape `?code=` is the only realistic leak surface for that parameter. - `client_secret` added to `SENSITIVE_KEYS` so direct-key field access (`{ client_id, client_secret }`) also redacts via the object-walker path. Coverage — 12 new fixtures in `tests/unit/utils/logger-redactor.oauth.test.ts`: - Fixture shape 1: `{ url: '…?code=…' }`. - Fixture shape 2: `{ redirectUrl: '…?refresh_token=…' }`. - Fixture shape 3: template literal carrying `err.message` with URL. - Fixture shape 4: `{ body: JSON.stringify(response) }`. - Plus: defence-in-depth `?access_token=` strays, pretty-printed JSON (whitespace between `:` and value), the "no over-redaction" pin for generic `{"code": "…"}` shapes, and the end-to-end test that drives both shapes through `makeRedactingLogger` in one call. Zero OAuth-logging callers introduced in this commit. Tests pass against the existing webhook-only suite (566 total, +12 new), and the extension is byte-identical for webhook-only forks — `WEBHOOK_URL_RE` is unchanged and runs first.

) Round-3 review on PR-2d flagged `requestId?: string` as forward-compat that invited "forgot-to-set" bugs — issue #214's resolution was to add a strict accessor that throws loudly the moment the middleware fails to populate the field, instead of returning `undefined` and letting downstream `jq` queries silently miss correlation ids. The accessor: ```ts export function getRequestId(): string { const ctx = tenantContext.getStore() if (!ctx?.requestId) { throw new Error( 'getRequestId() called outside a runWithTenant scope, or runWithTenant was ' + 'called without a requestId. The MCP middleware (PR-2c) must wrap every ' + 'request with `runWithTenant({memberId, userId, requestId}, …)`. ' + 'See docs/OAUTH-DESIGN.md §11 and issue #214.', ) } return ctx.requestId } ``` Two failure modes covered by one throw — both are "middleware not wired" bugs, neither is a runtime branch: 1. `getRequestId()` called outside any `runWithTenant` (no ALS scope). 2. `runWithTenant` was called with just `{memberId, userId}` (a partial context that PR-2a's tests still use; production middleware in PR-2c sets the field unconditionally). The PR-2d-introduced `requestId?: string` optional field stays as-is in the interface — tests that construct partial contexts keep compiling. Production code reads via `getRequestId()`, not via the raw field, so a missing value becomes a loud error at the first logger.info call (the helper is invoked call. Tests: +4 cases in `tests/unit/utils/request-context.test.ts`: - Happy path returns the field value. - Throws when called with no `runWithTenant` wrap. - Throws when wrap omits `requestId`. - Error message contains the operator-actionable hints (`MCP middleware (PR-2c)`, `§11`, `#214`) so a stack-trace-only reader can trace back to the design doc + issue. Zero callers introduced in this commit — `oauth.*` and `mcp.auth.*` emitters that use the helper land in subsequent PR-2c commits. Refs: #214.

Marketplace-app install entry point — operator pastes `https://<mcp-host>/api/oauth/install?portal=<portal>` into their README; each user who clicks lands here, gets a state nonce minted + CSRF cookie set, and is 302-redirected to their portal's authorize page. After the user approves, Bitrix24 redirects back to /callback (PR-2c step 5). Gates (all fail loud with errorCode in `data`): - 503 FLAG-OFF — NUXT_BITRIX24_OAUTH_ENABLED=false. The install link should not be reachable on a webhook-only deploy or after a rollback (§10). - 503 NOT-CONFIGURED — flag is on but CLIENT_ID or REDIRECT_URL is empty (operator partially-configured). Names the missing variable in the structured log line. - 400 PORTAL-FORMAT — `?portal=` failed the allow-list regex from §8 #3: `^[a-z0-9-]+\.bitrix24\.(com|ru|eu|de|by|kz|ua)$`. Defends against open-redirector abuse (any host smuggled in via `?portal=` would otherwise end up in the 302 Location). Happy path: - Mints two independent 32-byte hex nonces — `state` (persisted in oauth_state with 5-min TTL) and `csrfCookie` (set as a first-party cookie). The cookie/state pair is what /callback will verify. - Cookie attributes: `HttpOnly; Secure; SameSite=Lax; Path=/api/oauth/`. Lax allows the top-level navigation back from Bitrix24's authorize page; Path-scoping keeps the cookie out of `/mcp` and tool routes. - 302 to `https://<portal>/oauth/authorize/` with `client_id`, `state`, `redirect_uri`, `scope`, `response_type=code`. Logging (§11 taxonomy): - `oauth.install.start` (INFO, on entry — portal + clientId visible) - `oauth.install.deny.flag-off` (WARN) - `oauth.install.deny.not-configured` (ERROR — operator misconfig) - `oauth.install.deny.portal-format` (WARN — caller error / probe) - `oauth.install.ok` (INFO — `statePrefix` is the first 8 hex chars of the nonce only; the full value is a secret bound to the CSRF check, per §11 debug-trace policy) The structured logger goes through the redactor extended in step 1, so any future code path that accidentally feeds an OAuth URL into the context (e.g. via a wrapped error) gets scrubbed before it reaches the sink. Tests (`tests/unit/api/oauth/install.test.ts`, 23 cases): - Each gate path → expected statusCode + errorCode in body. - 7 happy-portal hostnames accepted; 7 invalid (empty, non-bitrix24, javascript scheme, unlisted TLD `bitrix24.us`, no subdomain, path/port injection) → 400. - 302 to authorize URL with all expected query params. - oauth_state row persisted with portal + clientId + 5-min TTL. - Cookie set with the documented attributes + Max-Age=300. - Cookie value MATCHES the persisted csrfCookie (the pairing the callback will verify — pinned so a future split doesn't break it). - Full state nonce NEVER appears in any logged context — only the 8-char `statePrefix`. - Two sequential calls produce distinct state + cookie nonces (no reuse). The tests drive a real h3 `createApp` + `toNodeListener` so `setCookie`, `sendRedirect`, and `createError` go through the actual h3 code paths — no shortcuts. No other code paths touch this route yet. Callback handler lands in the next commit; the Bearer middleware that consumes `mcp_tokens` lands after that.

…step 4-5) Closes #211 in spirit (the scheduler the §11 contract demanded for the oauth_state table); the health endpoint per §11 lands together so the operator-tier observability surface arrives in one piece. ## /api/oauth/_health (step 4 of PR-2c) Operator-tier counts endpoint — no PII, no tokens, no portal hosts. Returns: { enabled: true, dbPath: '/data/oauth.sqlite', tenants: 12, // count of oauth_tokens rows bearers: 47, // count of active mcp_tokens (revoked_at IS NULL) pendingStates: 3, // count of oauth_state rows with expires_at > now lastRefreshOk: null, // stub — PR-2c step 6 (B24OAuth factory) populates lastRefreshFail: null, // ditto } Three new prepared statements on TokenStore + a `getHealthCounts()` verb. All three queries run in one synchronous bundle (`better-sqlite3` is sync) so the endpoint is a single round-trip. Authentication — fails closed (§11 mandate). Two acceptable patterns: 1. Network isolation: request from 127.0.0.1 / ::1 (nginx allow/deny upstream). No application-level token; ops infra owns access. The recommended pattern for the reference docker-compose setup. 2. Dedicated `NUXT_BITRIX24_OAUTH_ADMIN_TOKEN` env var. Constant-time compared against `Authorization: Bearer`. Use this if network isolation is infeasible (shared single-host setups). Failure modes (all loud, with errorCode in the JSON body): - 503 FLAG-OFF — NUXT_BITRIX24_OAUTH_ENABLED=false. - 503 NOT-CONFIGURED — neither admin token nor localhost. Default config ships closed. - 401 ADMIN-TOKEN-MISSING — token configured, no Bearer header. - 401 ADMIN-TOKEN-INVALID — Bearer present but wrong. NEVER falls back to `NUXT_MCP_AUTH_TOKEN` (the agent's token) — pinned by a dedicated test. Privilege separation: a compromised agent (prompt injection, jailbroken DXT) must not read fleet-level counts. Once the operator opts in to token-based auth, the route is uniformly token-gated even from localhost — no implicit bypass. A dev box must include the Bearer too. ## pruneExpiredStates scheduler (step 5 of PR-2c, issue #211) `oauth_state` rows live 5 minutes (the install→callback TTL). Rows that never get consumed (browser closed, link not clicked, `/install` spam) accumulate without cleanup. The schema-bootstrap plugin (`server/plugins/oauth-schema.ts`) now arms a 5-minute setInterval: - First tick at +5 min, NOT on boot. - Logs `oauth.state.prune.ok` ONLY when rows > 0 (no log spam on empty buckets). - Logs `oauth.state.prune.fail` on transient SQLite errors (disk full, lock contention) and CONTINUES — next tick retries. - Cleared via Nitro's `close` hook so SIGTERM / dev hot-reload don't leave a zombie timer. - `timer.unref()` so a hanging timer doesn't keep the process alive if the close hook didn't fire (defensive). Flag-off path: zero behaviour change. No timer armed, no close hook registered, no `useTokenStore()` call. `PRUNE_INTERVAL_MS` exported as `@internal` for testability. ## .env.example + nuxt.config.ts Added `bitrix24OauthAdminToken: ''` to runtimeConfig (NUXT_BITRIX24_OAUTH_ADMIN_TOKEN). Documented in the JSDoc that this token MUST NOT fall back to mcpAuthToken — privilege levels differ. ## Tests - `tests/unit/api/oauth/_health.test.ts` (14 cases): each gating branch + counts shape + dbPath derivation + log-payload assertion + the MCP_AUTH_TOKEN privilege-separation invariant + the localhost+token "token wins" behaviour. - `tests/unit/utils/oauth-schema.test.ts` (+5 cases on top of the existing 3): timer arms at +5 min, logs on >0 prunes, silent on 0 prunes, survives a prune throw, close-hook clears the timer, flag-off arms NOTHING. Total: 612 passed, 1 skipped, 2 todo. Typecheck + eslint clean. Refs: #211, §11.

The user-visible half of the install→callback handshake. Bitrix24 redirects the user here after the authorize page; this handler verifies the state nonce, exchanges the authorization code for tokens, persists the tenant + mints a Bearer, and renders an HTML page that shows the Bearer ONCE with paste instructions. Verification chain (each failure → distinct errorCode + 4xx): - PARAMS-MISSING — code or state absent from the URL. - STATE-MISSING — state not in oauth_state OR expired. Single atomic `DELETE … RETURNING` (PR-2b) — re-using a state is a hard no. - STATE-COOKIE-MISMATCH — CSRF cookie doesn't match the persisted csrfCookie (§8 #2). - STATE-PORTAL-MISMATCH — Bitrix24's callback `?domain=` disagrees with the install portal (replay across portals defence). - STATE-CLIENT-MISMATCH — clientId rotated between install + callback (config drift defence). Token exchange (POST to `https://oauth.bitrix24.tech/oauth/token/`): - EXCHANGE-NETWORK (502) — fetch rejected (DNS / ECONNREFUSED). - EXCHANGE-NON-JSON (502) — response body wasn't JSON (upstream maintenance page). - EXCHANGE-FAIL (502) — Bitrix24 returned `{ error: ... }` or non-2xx status. `invalid_grant` is the canonical "code reused / expired" path. - EXCHANGE-BAD-USER-ID (502) — Bitrix24 returned a non-numeric user_id (defence in depth). The `client_secret` is sent in the POST body (form-urlencoded) so it never appears in any URL-shaped log line even by accident. The redactor extended in step 1 catches OAuth secrets in BOTH URL-query and JSON-body shapes — so even an `err.message` carrying a fetched URL or a logged response body gets scrubbed. Happy path: 1. `upsertTokens` writes oauth_tokens (audit-first via the PR-2b invariant — recordAuditEvent BEFORE the SQLite write). 2. `createMcpToken` mints a Bearer for the (member_id, user_id) pair. The persisted portal hostname is used as the default label so the PR-2c-pending "list my Bearers" operator tool (issue #212) can identify it without the user naming it. 3. The CSRF cookie is cleared (`deleteCookie` emits `Max-Age=0`). 4. Response headers: `Cache-Control: no-store, no-cache`, `Pragma: no-cache`, `Content-Type: text/html; charset=utf-8`. The Bearer is embedded in a `<pre>` block, never in a URL. 5. HTML page is intentionally JS-free (no external script-src) — the operator pastes the token manually. Logging (§11 taxonomy): - oauth.callback.start (INFO — statePrefix 8 chars only) - oauth.callback.deny.<reason> (WARN/ERROR) - oauth.callback.exchange.fail (ERROR — Bitrix24 error code, NOT error_description or raw URL/body) - oauth.callback.exchange.ok (INFO — memberId, userId, bearerHashPrefix 'sha256-'+8 hex, portal; NEVER the raw Bearer) Tests (`tests/unit/api/oauth/callback.test.ts`, 16 cases): - Each verification branch with a distinct seed scenario. - Each exchange failure mode with a corresponding `fetchMock` shape (network reject, invalid_grant, 5xx, non-JSON, bad-user-id). - State is consumed atomically — second callback with same state → STATE-MISSING. - Happy path: oauth_tokens row landed with correct fields, mcp_tokens Bearer minted, the round-trip `hashBearer(raw) === findByBearerHash` invariant holds. - The raw Bearer NEVER appears in any logged context (full set scan). - The POST body shape is pinned: grant_type, client_id, client_secret, code, redirect_uri. Fetch is mocked via `vi.stubGlobal('fetch', vi.fn())` instead of `msw` — one fetch site, no need for the dev-dependency surface. Total: 628 passed (+16 callback). Typecheck + eslint clean. Refs: §3, §8, §11.

`useBitrix24OAuth(memberId, userId): B24OAuth` is the per-tenant client factory `docs/OAUTH-DESIGN.md §5 + §7` promised. Used by `useBitrix24Tenant()` when `NUXT_BITRIX24_OAUTH_ENABLED=true` and a tenant context is bound; tools never call it directly. ## Caching strategy - In-memory LRU keyed on `${memberId}:${userId}`, capped at 100 entries (§5 sizing — covers the realistic per-process working set). Map-based with promote-on-access (delete + re-set preserves insertion order which IS the LRU order). - Synchronous construction. JavaScript's single-threaded event loop means two callers cannot be mid-init for the same key simultaneously, so no mutex is needed; the first caller's `lruSet` settles before the next caller's `lruGet` runs. The §5 "per-tenant mutex" narrative applies only to the REFRESH path inside the SDK (`setCustomRefreshAuth` is async-callable by the SDK), and that's handled by the callback below. - Cache eviction caveat (§5 known edge case): if an entry is evicted while a refresh is in flight inside the SDK, a concurrent call after eviction creates a second instance, both refresh attempts write the same new tokens (idempotent). Eviction is logged so the LRU can be sized correctly. Keeping the factory sync also keeps `useBitrix24Tenant()` sync — which avoids touching the 20+ tool call sites that PR-2d migrated. The async-ness in §7's draft signature was design-doc-driven but isn't actually needed: B24OAuth construction is sync, and refresh is already async via the SDK's setCustomRefreshAuth contract. ## Refresh handling — `setCustomRefreshAuth` Chose `setCustomRefreshAuth` over `setCallbackRefreshAuth` so the HTTP exchange happens on OUR fetch (testable via `vi.stubGlobal('fetch', ...)`), and the error classification is direct: - `invalid_grant` → `markRefreshFailed(memberId, userId)` stamps every active Bearer `revoked_at`. Evicts the cached instance so a re-authorise lands a fresh one. The user re-authorises via /install. - Network error / 5xx → log + re-throw WITHOUT `markRefreshFailed`. Refresh failure may be transient (DNS, brief outage); Bearers stay alive. Process-local `refreshStatus` tracker captures `lastRefreshOk` / `lastRefreshFail` timestamps. Exported as `_readRefreshStatus()` for the `/api/oauth/_health` endpoint to populate the corresponding fields (currently stubbed `null` — wired into _health in the next commit). ## Logging (§11 taxonomy) - `oauth.refresh.start` (INFO) - `oauth.refresh.ok` (INFO — new accessExpiresAt) - `oauth.refresh.fail.invalid-grant` (ERROR — refresh-token death) - `oauth.refresh.fail.transient` (ERROR — network / 5xx / non-JSON) - `oauth.factory.lru.evicted` (DEBUG — eviction signal so the operator can size the LRU) The raw `refresh_token` / `access_token` / `client_secret` NEVER appear in any logged context — verified by an end-to-end log-scan test. ## Tests (tests/unit/utils/bitrix24-oauth.test.ts, 11 cases) Caching (6): - Returns a `B24OAuth` instance for an existing tenant. - Two sequential calls → same instance (identity). - N=10 concurrent calls → all share one instance (sync construction in JS's single-threaded loop = no race window). - Different tenants → different instances. - Missing tenant row → throws `oauth_tokens row missing`. - Missing CLIENT_ID → throws. Refresh (4): - HTTP POST hits `oauth.bitrix24.tech/oauth/token/` with correct `grant_type=refresh_token` body. New tokens persisted via `upsertTokens('refresh')`. `lastRefreshOk` non-null. `oauth.refresh.ok` logged. - `invalid_grant` → `markRefreshFailed` evicts active Bearers, cache evicted (subsequent call gets a fresh instance), `lastRefreshFail` non-null, `oauth.refresh.fail.invalid-grant` logged. - 5xx → re-throws WITHOUT touching Bearers (transient class), `oauth.refresh.fail.transient` logged. - Log-scan invariant: raw `refresh_token`, `access_token`, `client_secret` values NEVER appear in any logged context. Status tracker (1): - `_readRefreshStatus()` starts both fields at `null` on a fresh process. The test sniffer pattern (`Object.defineProperty` on `B24OAuth.prototype.setCustomRefreshAuth`) captures the factory's refresh callback so it can be invoked directly — equivalent to "the SDK detected expiry and called refresh" without round-tripping through the SDK's internal expiry check. Refs: §5, §7, §11.

…ep 8) `useBitrix24Tenant()`'s OAuth-on branch previously threw "OAuth path not yet implemented (lands in PR-2c)" with the tenant ids on the operator-side logger. This commit replaces the throw with the real `useBitrix24OAuth(tenant.memberId, tenant.userId)` call — the per-tenant B24OAuth factory from step 7. Tools that have been routing through the dispatcher since PR-2d (#213) now resolve to per-tenant clients when the flag is on, without a single tool-side edit. Wired changes: - `server/utils/bitrix24-tenant.ts`: - Import `useBitrix24OAuth` from `~/server/utils/bitrix24-oauth`. - OAuth-on + tenant-bound path now `return useBitrix24OAuth(memberId, userIdNum)`. - `tenant.userId` is a string in `TenantContext` (matches audit-log shape); the factory expects a number. Defensive `Number.parseInt` with a `Number.isFinite` guard that throws + logs `oauth.tenant.dispatch.bad-user-id` if a non-numeric value slips through. In practice the Bearer middleware will only construct `TenantContext` from a verified `findByBearerHash` result whose `user_id` column is INTEGER NOT NULL, so this is belt-and-braces. - JSDoc rewritten — the "not yet implemented" caveat is gone; the behaviour description now matches the wired-up reality. - The no-tenant-scope branch also logs via the structured logger now (`oauth.tenant.dispatch.no-tenant-scope`) for consistency with the bad-user-id branch. - `server/api/oauth/_health.get.ts`: - Import `_readRefreshStatus` from `~/server/utils/bitrix24-oauth`. - `lastRefreshOk` / `lastRefreshFail` now reflect the factory's process-local tracker (both `null` on a fresh process, correctly signalling "no refresh has been attempted yet"). The log payload includes the refresh status fields alongside counts so a single `oauth.health.ok` log line carries the full health snapshot. Tests: - `tests/unit/utils/bitrix24-tenant.test.ts`: - Mock `~/server/utils/bitrix24-oauth` with a sentinel client so the dispatcher's tenant-to-client resolution is controllable without seeding a real SQLite store from inside the dispatcher's unit test. - Replaced "throws not yet implemented" tests with the corresponding wired-up assertions: factory called with correct args, dispatcher returns the factory's result. - Added a non-numeric-userId guard test. - The N=10 concurrent test now pins the bijection on factory call arguments instead of error-message payloads (cross-tenant leak guard survives at the new layer). - `tests/unit/tools/users/current-user.test.ts`: - The "tenant-scope-bound branch" test now asserts on the factory's `row missing` error (the realistic failure with no oauth_tokens seed) rather than the now-removed "not yet implemented" string. Total: 640 passed (+1 dispatcher test rename, -1 factory wiring replaces the stub). Refs: §7, §11, #214.

Three CI regressions on the previous push, all fixed: 1) **Commit messages** check ran on the PR title (not the commits) and flagged 'OAuth' as pascal-case. PR title updated to 'oauth install/ callback handshake + §11 observability surface (PR-2c, WIP draft)' via the GitHub MCP tool — separately from this commit, since the title isn't a repo artefact. This commit doesn't carry the title fix; mentioned for traceability. 2) **Unit tests** — `Cannot find package 'h3'` in the three new `tests/unit/api/oauth/*.test.ts` files. h3 is a transitive Nuxt dependency, accessible during local dev (pnpm hoists it), but CI's stricter ESM resolution refuses to load a direct import that isn't declared in package.json. Added `h3` to devDependencies pinning to the version already in node_modules (1.15.11). The tests use h3's `createApp` + `toNodeListener` to drive routes through the real h3 framework end-to-end (cookies, redirects, headers); mocking h3's internals would be brittle. 3) **Unit tests** — `EACCES: permission denied, mkdir '/data'` in `tests/unit/tools/users/current-user.test.ts`. The flag-on guard test runs the dispatcher → factory → `useTokenStore()` chain, and the lazy `useTokenStore()` open tries to mkdir `/data` on the CI runner. The factory's behaviour is already exhaustively covered by `tests/unit/utils/bitrix24-oauth.test.ts`; the catalogue test only needs to assert that the catalogue surfaces the factory's loud failure. Mocked `~/server/utils/bitrix24-oauth.useBitrix24OAuth` to throw a stable `oauth_tokens row missing` error so the test asserts on shape without touching the filesystem. Local validation: - `npx vitest run` — 640 passed, 1 skipped, 2 todo. - `npx nuxt typecheck` — clean. - `npx eslint .` — clean. No production code changed; only test isolation + devDep declaration.

After the previous CI-fix push, `tests/unit/api/oauth/callback.test.ts` still failed with EACCES on `mkdir /data` from `audit-log.ts:248`. The callback handler calls `store.upsertTokens()` and `store.createMcpToken()`, both of which trigger `recordAuditEvent` (the PR-2b audit-first invariant). The real audit-log lazily creates `${NUXT_AUDIT_DIR ?? /data/audit}/` on first write — fails on a CI runner without /data write permission, and the 500 from the handler cascades into three assertion failures on the happy-path tests. Same hazard would apply to `_health.test.ts` once the "counts" test runs against an environment without /data; defensive-mock there too. The audit-first invariant is already exhaustively tested in `token-store.test.ts` via the same `vi.hoisted` + `vi.mock` pattern — copying that pattern into the two route tests gives uniform isolation without changing what the route tests actually assert (statusCode, headers, body, persisted DB state). `install.test.ts` doesn't trigger audit (only `createState` which is not audit-first), so it stays unchanged. Local: 640 passed, typecheck + eslint clean.

…ow-up) After the previous CI-fix push (403cac9), `tests/unit/utils/bitrix24-oauth.test.ts` was still leaving 7 unhandled rejections from `audit-log.ts:248` (EACCES on `mkdir /data`). The factory tests call `store.upsertTokens(SAMPLE_TENANT, 'install')` in setup + `createMcpToken` in body; both trigger `recordAuditEvent` via the PR-2b audit-first invariant. The audit-log lazily writes to `${NUXT_AUDIT_DIR ?? /data/audit}/`, which fails with EACCES on CI runners without /data write permission. Same `vi.hoisted` + `vi.mock('~/server/utils/audit-log', ...)` pattern as `token-store.test.ts`, `callback.test.ts`, `_health.test.ts`. The audit- first invariant itself is exhaustively tested in `token-store.test.ts`; the factory tests only need the DB-side effects. Last test file in this PR that exercises real `recordAuditEvent` — every other OAuth-related test either mocks audit-log directly or mocks the factory/token-store layer above it. Local: 640 passed, typecheck + eslint clean.

Two single-file run-it-and-paste-the-output scripts for the operator manually verifying the new OAuth install/callback/health surface without needing a real Bitrix24 portal: - scripts/manual-qa-pr2c.sh (bash, Linux/macOS/WSL/Git-Bash) - scripts/manual-qa-pr2c.ps1 (PowerShell, Windows native) Both detect which scenario the running server is in by probing /api/oauth/install with no params, then run the appropriate assertion set: - Scenario A (flag off, default): /api/health still 200; OAuth endpoints refuse with errorCode FLAG-OFF. This is the pre-deployment smoke — proves the OAuth diff is byte-identical to webhook-only behaviour. - Scenario B (flag on, configured): portal allow-list rejects junk; happy install → 302 + CSRF cookie with HttpOnly + SameSite=Lax; callback gates work (PARAMS-MISSING, STATE-MISSING); /api/oauth/_health gating (200 from localhost, 401 with token). - Scenario C (flag on, missing CLIENT_ID/REDIRECT_URL): the script detects + names the missing config so the operator knows where to look. Real Bitrix24 round-trip (install → authorize → callback with a real code) is OUT of scope for the scripts — that needs a registered Marketplace app + a real user. The scripts pin the gates that DON'T need it, which catches every regression class except "the actual token-exchange POST fails", and that is covered by the `tests/unit/api/oauth/callback.test.ts` fetch-mock suite. Exit code 0 if all assertions pass, 1 if any failed, 2 if the script couldn't reach the server. Usage: pnpm dev # or: docker compose up ./scripts/manual-qa-pr2c.sh # or on Windows: .\scripts\manual-qa-pr2c.ps1

ShellCheck CI flagged `exit $([ $FAIL -eq 0 ] && echo 0 || echo 1)` at line 180 (SC2046: command substitution without quotes is word-split). Replaced the cleverness with a plain if/else — clearer and lint-clean: if [ "$FAIL" -eq 0 ]; then exit 0; else exit 1; fi `bash -n` passes; the other command substitutions in the file are all assigned to variables (not word-split), so they're safe.

@throws

Five Sonnet agents (docs/programmer/tester/security/CTO) reviewed the install/callback/factory surface. Consolidated fixes, all in this PR except the Bearer-middleware tracking issue (#217, agreed split). SECURITY (Security #1, 🔴) — X-Forwarded-For localhost spoof on _health: `getRequestIP(event, { xForwardedFor: true })` trusted a client- supplied header, so `curl -H 'X-Forwarded-For: 127.0.0.1'` from the internet passed the localhost gate and read fleet counts. Switched to the raw socket IP (`getRequestIP(event)`). New test proves the spoof now gets 503. SECURITY/Programmer (🔴) — _health leaked dbPath: Removed `dbPath` from the response body — a filesystem path is infra topology that aids a post-auth attacker / a misconfigured-nginx exposure. Added `processStartedAt` instead so a dashboard can tell "lastRefreshOk null because just restarted" from "null because no refresh ever ran" (CTO #3). Docs/Programmer (🔴) — STATE-EXPIRED vs STATE-MISSING: `consumeState` folded expired + never-existed into one `undefined`, so `oauth.callback.deny.state-expired` was unemittable and a slow user looked identical to a probe in the logs. `consumeState` now returns the row regardless of expiry (still deletes it — replay protection); the callback checks `expiresAt` and emits STATE-EXPIRED (INFO) vs STATE-MISSING (WARN) with distinct errorCodes. Docs (🔴) — §11 taxonomy didn't match the code: Rewrote the §11 event taxonomy to match every emitted event (install/callback/health/refresh/dispatch), removed the phantom `portal-host` (the one regex covers format + TLD), added the `mcp.auth.deny.*` block as explicitly "(deferred)". §10 rollout table updated: PR-2c is #216, Bearer middleware split to #217. Docs (🔴) — NUXT_BITRIX24_OAUTH_ADMIN_TOKEN undocumented: Added it to `.env.example` with the privilege-separation rationale + the multi-instance scaling caveat (CTO #5). Without this an operator hit a silent 503 with no idea which variable to set. Security (🟡) — CSRF cookie compare: `!==` → constant-time `timingSafeEqual` (matches the _health admin token compare). Security (🟡) — member_id validation: Bitrix24's `member_id` is now regex-validated (`^[a-zA-Z0-9._:-]{1,128}$`) before it becomes a SQLite PK + log field — a compromised upstream can't smuggle a path-traversal / over-long string. New EXCHANGE-BAD-MEMBER-ID errorCode + test. Security (🟡) — domain binding optional: When Bitrix24 omits `?domain=`, the portal binding can't be checked; now logs `oauth.callback.domain-absent` (WARN) so an operator can spot a flow missing the binding. The other three §8 checks still hold. Programmer (🟡) — callback HTTP semantics: Bitrix24 5xx → 503 (retryable upstream outage), 4xx/explicit error → 502. Was uniformly 502. New tests for both. Programmer (🟡) — OAUTH_JSON_RE under-redaction: `[^"\\]+` stopped at a stray backslash and leaked the token tail. Changed to `[^"]+` (OAuth tokens never contain a literal quote). Corrected the misleading comment. New fixture test. Tester (🔴) — LRU eviction untested: Added `_setLruMaxForTests` hook + two tests: eviction-at-capacity logs `oauth.factory.lru.evicted`, and cache-hit promotes to MRU (true LRU order). Tester (🟡) — audit-first at the handler boundary: New test: a failing `recordAuditEvent` aborts the callback BEFORE a Bearer lands (the end-to-end audit-first contract, previously only pinned at the token-store layer). Tester (🟡) — privilege-separation completeness: New test: even with mcpAuthToken set to a value, the admin gate checks the admin-token FIELD — a bare string match can't sneak the agent token through. Programmer (🟢) — getRequestId @throws JSDoc; LRU `cache.has` no-op removed; comment cleanups. Tests: 648 → 660+ (12 new), typecheck + eslint clean. Refs: #217 (Bearer middleware follow-up).

The "Manual-QA scaffold sync" CI gate (SKILL.md Ground Rule 5) requires that any `.env.example` change is mirrored in `skills/run-manual-qa/references/issue-scaffold.md` in the same PR. The round-3 fix added `NUXT_BITRIX24_OAUTH_ADMIN_TOKEN` to `.env.example` but missed the scaffold — this commit closes that. Also refreshes the OAuth-scaffolding paragraph to reflect reality: install/callback + the B24OAuth factory have landed, but the `/mcp` Bearer middleware is still pending (#217), so a minted Bearer isn't accepted by `/mcp` yet. Last-reviewed stamp bumped to 2026-06-05.

Final review round (5 Sonnet agents). CTO verdict: ship after these. All fixes in this PR; no new deferrals. Docs (🔴) — §11 drift from code: - The _health response example still showed `dbPath` (removed in round-3); replaced with `processStartedAt`, added a "no dbPath" note. - The errorCode rule ("suffix after the last dot") didn't cover the compound `EXCHANGE-*` codes on `oauth.callback.exchange.fail`; documented the exception explicitly. - Three references to `NUXT_OAUTH_ADMIN_TOKEN` were missing the `BITRIX24` segment — corrected to `NUXT_BITRIX24_OAUTH_ADMIN_TOKEN` (an operator copying the typo would have hit a silent 503). - Added the `oauth.callback.state-row-corrupt` event to the taxonomy. Security (🟡) — _health loopback range too narrow: `LOCALHOST_IPS` was a 3-element Set; a legitimate orchestrator probe from `127.0.0.2` (the whole 127.0.0.0/8 is loopback per RFC 5735) got 503. Replaced with an `isLoopback()` range check. New test. Security (🟡) — XSS defence-in-depth on the error page: `callbackErrorPage` escaped `detail` but not `errorCode`. Both are code literals today, but a future refactor passing a Bitrix24- controlled `exchange.error` into that slot would reflect unescaped. Extracted a shared `htmlEscape()` and applied it to both slots (and reused it in `bearerSuccessPage`). Security (🟢→doc) — X-Forwarded-For deployment caveat: The raw-socket-IP fix is complete for the `node-server` preset, but edge presets (Cloudflare/Vercel) populate `event.context.clientAddress` from forwarded headers. §11 now documents that forks on an edge preset MUST use admin-token mode, not localhost isolation. Programmer (🟡) — corrupt-DB CSRF guard: `timingSafeEqual('', '')` returns true. If a corrupt state row had an empty `csrf_cookie`, a request with NO cookie would pass. Added an explicit non-empty guard → 500 STATE-ROW-CORRUPT before the compare. Programmer (🟡) — STATE-EXPIRED boundary: Documented the intentional one-second skew between the callback's `< now` accept and `_health`'s `> now` pending-count (cosmetic against the 5-min TTL). Tester (🔴) — audit-first handler test strengthened: Now asserts the exact 500 status AND counts `mcp_tokens` rows == 0 (proves NO Bearer landed, not just no oauth_tokens row). Tester (🟡) — X-Forwarded-For test now distinguishes mechanisms: Added a case: remote IP + spoofed XFF=127.0.0.1 + admin token set → still 401 (the spoof didn't grant a localhost bypass), which the 503-no-config path alone didn't prove. Tester (🟡) — processStartedAt asserts a sane recent value (catches a ms-vs-seconds regression), not just `typeof === 'number'`. Tester (🟢) — domain-absent branch now tested: a callback with no `?domain=` succeeds (other 3 §8 bindings hold) and logs `oauth.callback.domain-absent`. PR body Changelog framing expanded to list the round-3/4 security fixes (X-Forwarded-For, dbPath removal, member_id validation, CSRF constant-time, HTML escaping) — these are operator-relevant history. Tests: 652 passed (+7), typecheck + eslint clean. Refs: #217 (Bearer middleware follow-up).

Five Sonnet agents reviewed the Bearer middleware PR. CTO verdict was "ship", all critical findings in-PR. Programmer/Security/Tester each caught 2-3 actionable items. Docs (stale text now that the middleware is live): - §11 dropped "(once the Bearer middleware lands)" — the WWW-Authenticate header is set NOW, not "future". Added an explicit example of the header shape. - §10 step 5: removed "callback HTML page warns the user about this" (the warning was deleted earlier in this PR) — now reads as historical context ("between #216 and #217 the page warned about Y"). - §7 token-store inventory updated: inspectBearer added next to findByBearerHash with the rationale for the two-verb split. - §6 mcp-auth.ts description rewritten to reflect the toolkit-vs-h3 middleware split + the three distinct errorCodes. - §10 step 8 (PR-5) gained the operator-facing migration warning requirement from the CTO review: "NUXT_MCP_AUTH_TOKEN is bypassed on /mcp when OAUTH_ENABLED=true" must land in DEPLOYMENT.md before any public OAuth announcement. Tester (3 must-fix): - ALS scope cleanup test: assert getTenantContext() is undefined OUTSIDE the middleware wrap. Defends against a future als.enterWith regression that would leak tenant context across requests. - WWW-Authenticate full RFC 6750 §3 shape now pinned (Bearer error="invalid_token" + errorCode + error_description) on BEARER-REVOKED and BEARER-ORPHAN tests, not just the errorCode substring. - BEARER-REVOKED takes precedence over BEARER-ORPHAN test added — catches a future refactor that swaps the two if-blocks. - Concurrent N=2 cross-tenant ALS-isolation test added: two parallel middleware invocations with different Bearers each see THEIR OWN tenant inside next(). The middleware-layer counterpart of PR-2a's #64 ALS spike. Programmer: - Replaced `denyBearer(...)` calls with `return denyBearer(...)` so TypeScript narrows `token` to string without a non-null assertion. The `token!` on the hash construction is gone. - Added a one-line comment on the String(inspection.userId) coercion documenting the PR-2a TenantContext.userId-is-string contract. Security: - wwwAuthHeader now escapes backslash and quote in `code` / `description` defensively (current callers pass literals, but a future refactor threading a Bitrix24-controlled value into `description` shouldn't be able to inject extra header attributes). - server/middleware/mcp-auth.ts no longer yields unconditionally on flag=true — defence-in-depth: it requires AT LEAST the `Authorization: Bearer …` shape before yielding. Worst case if the toolkit middleware is missing (HMR / module-init bug), /mcp still returns 401, not an auth bypass. Tests: 667 passed (+4 round-1 cases), typecheck + eslint clean. Round 1 is the FINAL review round per CTO verdict.

claude added 6 commits June 5, 2026 12:26

IgorShevchik force-pushed the claude/oauth-pr2c-install-callback branch from 61cc972 to 0cca4d4 Compare June 5, 2026 13:06

IgorShevchik changed the title ~~feat(auth): OAuth install/callback handshake + §11 observability surface (PR-2c, WIP draft)~~ feat(auth): oauth install/callback handshake + §11 observability surface (PR-2c, WIP draft) Jun 5, 2026

claude added 4 commits June 5, 2026 13:24

IgorShevchik changed the title ~~feat(auth): oauth install/callback handshake + §11 observability surface (PR-2c, WIP draft)~~ feat(auth): oauth install/callback handshake + b24oauth factory (PR-2c) Jun 5, 2026

IgorShevchik mentioned this pull request Jun 5, 2026

PR-2c-bearer: wire OAuth Bearer middleware into /mcp (the last connection) #217

Closed

6 tasks

claude added 3 commits June 5, 2026 15:21

IgorShevchik marked this pull request as ready for review June 5, 2026 16:10

IgorShevchik merged commit 5842472 into main Jun 5, 2026
17 checks passed

This was referenced Jun 5, 2026

PR-2c: wire pruneExpiredStates scheduler (5-min interval) #211

Closed

PR-2c: requestId — populate unconditionally + getRequestId() throwing helper #214

Closed

feat(auth): bearer middleware - /mcp accepts oauth bearers (closes #217) #218

Merged

IgorShevchik mentioned this pull request Jun 5, 2026

docs(auth): operator guide for oauth multi-tenant (PR-5) #219

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(auth): oauth install/callback handshake + b24oauth factory (PR-2c)#216

feat(auth): oauth install/callback handshake + b24oauth factory (PR-2c)#216
IgorShevchik merged 15 commits into
mainfrom
claude/oauth-pr2c-install-callback

IgorShevchik commented Jun 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

IgorShevchik commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in

Manual QA — run these scripts before merging

What's NOT in (deferred, all tracked)

Refs / closes

Changelog framing (no BREAKING CHANGE)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

IgorShevchik commented Jun 5, 2026 •

edited

Loading