Skip to content

feat(auth): oauth install/callback handshake + b24oauth factory (PR-2c)#216

Merged
IgorShevchik merged 15 commits into
mainfrom
claude/oauth-pr2c-install-callback
Jun 5, 2026
Merged

feat(auth): oauth install/callback handshake + b24oauth factory (PR-2c)#216
IgorShevchik merged 15 commits into
mainfrom
claude/oauth-pr2c-install-callback

Conversation

@IgorShevchik

@IgorShevchik IgorShevchik commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Summary

PR-2c lands the user-visible half of the OAuth multi-tenant flow:

  • /api/oauth/install — portal allow-list, state nonce, CSRF cookie, redirect to Bitrix24 authorize.
  • /api/oauth/callback — state verification (4 §8 CSRF bindings), code-to-token exchange, Bearer mint, HTML page that shows the Bearer once.
  • /api/oauth/_health — operator-tier counts endpoint with fails-closed authentication (loopback 127.0.0.0/8 OR NUXT_BITRIX24_OAUTH_ADMIN_TOKEN, never the agent token).
  • useBitrix24OAuth(memberId, userId) — per-tenant B24OAuth factory with LRU(100) caching + refresh-on-expiry via setCustomRefreshAuth + markRefreshFailed on invalid_grant.
  • Dispatcher (useBitrix24Tenant()) now routes to the factory when the flag is on; tools that PR-2d migrated need ZERO further code changes.
  • pruneExpiredStates scheduler wired into the Nitro schema-bootstrap plugin (closes PR-2c: wire pruneExpiredStates scheduler (5-min interval) #211).
  • Logger-redactor extended with OAuth secret patterns — landed FIRST per OAUTH-DESIGN.md §11 ordering invariant, so no OAuth log line can leak.
  • getRequestId(): string throwing accessor (closes PR-2c: requestId — populate unconditionally + getRequestId() throwing helper #214) — forward-compat anchor for the Bearer middleware that completes the wire-up.

Default-off-safe end-to-end: with NUXT_BITRIX24_OAUTH_ENABLED=false (the production default) the new endpoints respond with 503 FLAG-OFF, no /data directory is touched, the existing webhook flow is byte-identical.

What's in

Logger-redactor OAuth extension → getRequestId() accessor → install route → _health + scheduler → callback route → B24OAuth factory → dispatcher wiring → manual-QA scripts. Plus four review rounds of fixes (CI green-up, then security/correctness hardening). See the commit history; on squash-merge this collapses to one.

652 unit tests passed (+113 vs main), typecheck + eslint + shellcheck + zizmor clean.

Manual QA — run these scripts before merging

pnpm dev   # or docker compose up
./scripts/manual-qa-pr2c.sh        # Linux / macOS / WSL / Git-Bash
.\scripts\manual-qa-pr2c.ps1       # Windows native PowerShell

Auto-detects the running scenario and prints PASS/FAIL per assertion:

  • Scenario ANUXT_BITRIX24_OAUTH_ENABLED=false (default): /api/health still 200, OAuth endpoints refuse with FLAG-OFF. The merge-readiness smoke — proves byte-identical-to-webhook.
  • Scenario B — flag on + configured: portal allow-list rejects junk, install → 302 + CSRF cookie (HttpOnly + SameSite=Lax), callback gates (PARAMS-MISSING / STATE-MISSING / STATE-EXPIRED), _health 200 from localhost.
  • Scenario C — flag on but CLIENT_ID/REDIRECT_URL missing: names the missing variable.

Full Bitrix24 round-trip (real Marketplace app + portal + user) is out of script scope — covered by callback.test.ts which mocks the token exchange via vi.stubGlobal('fetch', ...).

What's NOT in (deferred, all tracked)

Item Why deferred Tracked
Bearer middleware (Bearer → findByBearerHash → runWithTenant) on /mcp Needs defineMcpHandler({ middleware }) integration — its own review round. The last wire that makes a minted Bearer authenticate. #217
Migrate ~30 tool tests to vi.mock('~/server/utils/bitrix24-tenant') Mechanical, deserves its own PR #215
Final operator docs (README OAuth-first, docs/DEPLOYMENT.md) Batched PR-5

Without the Bearer middleware, /mcp still uses NUXT_MCP_AUTH_TOKEN. An operator can flip the flag, complete /install → /callback, and get a Bearer — but /mcp doesn't accept it until #217. The callback HTML page warns the user about this explicitly ("⚠ Not active yet").

Refs / closes

Changelog framing (no BREAKING CHANGE)

  • Feature → Auth: OAuth install / callback / health endpoints + per-tenant B24OAuth factory with refresh-on-expiry, all behind a default-off flag.
  • Security: logger-redactor scrubs OAuth secrets (URL + JSON-body shapes); /api/oauth/_health fails closed without operator-tier auth, reads the raw socket IP (no X-Forwarded-For spoof of the localhost gate), and never exposes the DB path; CSRF cookie compared in constant time; Bitrix24 member_id regex-validated before it becomes a SQLite key; callback HTML escapes all interpolated values.
  • Test: scripts/manual-qa-pr2c.{sh,ps1} for operator pre-merge verification.

claude added 6 commits June 5, 2026 12:26
§11 of docs/OAUTH-DESIGN.md mandates this commit lands FIRST inside
PR-2c — before any oauth.* handler emits a structured log line.
Reversing the order means a reviewer can be asked to accept a
`logger.info('oauth.callback.ok', { url })` call while the redactor
still passes `?code=…` through unchanged. The ordering removes that
class of mistake by making the redactor load-bearing at the first
caller.

What this commit adds (and nothing else):

- `OAUTH_URL_RE` — catches `?code=…`, `?refresh_token=…`,
  `?access_token=…`, `?client_secret=…` in URL query-string position.
  Preserves the key name (operator forensics) and replaces only the
  value with `<REDACTED>`.

- `OAUTH_JSON_RE` — catches `"access_token":"…"`, `"refresh_token":"…"`,
  `"client_secret":"…"` in JSON-stringified payloads. Closes the
  fixture-shape-4 leak in §11 (`logger.error('…', { body:
  JSON.stringify(res) })`). The OAuth authorization `code` is
  intentionally OMITTED from the JSON regex — `"code"` is too common a
  key name (HTTP status, error codes) and would over-redact; the
  URL-shape `?code=` is the only realistic leak surface for that
  parameter.

- `client_secret` added to `SENSITIVE_KEYS` so direct-key field access
  (`{ client_id, client_secret }`) also redacts via the object-walker
  path.

Coverage — 12 new fixtures in `tests/unit/utils/logger-redactor.oauth.test.ts`:
- Fixture shape 1: `{ url: '…?code=…' }`.
- Fixture shape 2: `{ redirectUrl: '…?refresh_token=…' }`.
- Fixture shape 3: template literal carrying `err.message` with URL.
- Fixture shape 4: `{ body: JSON.stringify(response) }`.
- Plus: defence-in-depth `?access_token=` strays, pretty-printed JSON
  (whitespace between `:` and value), the "no over-redaction" pin for
  generic `{"code": "…"}` shapes, and the end-to-end test that drives
  both shapes through `makeRedactingLogger` in one call.

Zero OAuth-logging callers introduced in this commit. Tests pass
against the existing webhook-only suite (566 total, +12 new), and the
extension is byte-identical for webhook-only forks — `WEBHOOK_URL_RE`
is unchanged and runs first.
)

Round-3 review on PR-2d flagged `requestId?: string` as forward-compat
that invited "forgot-to-set" bugs — issue #214's resolution was to add
a strict accessor that throws loudly the moment the middleware fails
to populate the field, instead of returning `undefined` and letting
downstream `jq` queries silently miss correlation ids.

The accessor:

```ts
export function getRequestId(): string {
  const ctx = tenantContext.getStore()
  if (!ctx?.requestId) {
    throw new Error(
      'getRequestId() called outside a runWithTenant scope, or runWithTenant was '
      + 'called without a requestId. The MCP middleware (PR-2c) must wrap every '
      + 'request with `runWithTenant({memberId, userId, requestId}, …)`. '
      + 'See docs/OAUTH-DESIGN.md §11 and issue #214.',
    )
  }
  return ctx.requestId
}
```

Two failure modes covered by one throw — both are "middleware not
wired" bugs, neither is a runtime branch:

1. `getRequestId()` called outside any `runWithTenant` (no ALS scope).
2. `runWithTenant` was called with just `{memberId, userId}` (a partial
   context that PR-2a's tests still use; production middleware in
   PR-2c sets the field unconditionally).

The PR-2d-introduced `requestId?: string` optional field stays as-is in
the interface — tests that construct partial contexts keep compiling.
Production code reads via `getRequestId()`, not via the raw field, so
a missing value becomes a loud error at the first logger.info call (the helper is invoked
call.

Tests: +4 cases in `tests/unit/utils/request-context.test.ts`:
- Happy path returns the field value.
- Throws when called with no `runWithTenant` wrap.
- Throws when wrap omits `requestId`.
- Error message contains the operator-actionable hints (`MCP middleware
  (PR-2c)`, `§11`, `#214`) so a stack-trace-only reader can trace back
  to the design doc + issue.

Zero callers introduced in this commit — `oauth.*` and `mcp.auth.*`
emitters that use the helper land in subsequent PR-2c commits.

Refs: #214.
Marketplace-app install entry point — operator pastes
`https://<mcp-host>/api/oauth/install?portal=<portal>` into their
README; each user who clicks lands here, gets a state nonce minted +
CSRF cookie set, and is 302-redirected to their portal's authorize
page. After the user approves, Bitrix24 redirects back to /callback
(PR-2c step 5).

Gates (all fail loud with errorCode in `data`):
- 503 FLAG-OFF        — NUXT_BITRIX24_OAUTH_ENABLED=false. The install
                        link should not be reachable on a webhook-only
                        deploy or after a rollback (§10).
- 503 NOT-CONFIGURED  — flag is on but CLIENT_ID or REDIRECT_URL is
                        empty (operator partially-configured). Names the
                        missing variable in the structured log line.
- 400 PORTAL-FORMAT   — `?portal=` failed the allow-list regex from §8
                        #3: `^[a-z0-9-]+\.bitrix24\.(com|ru|eu|de|by|kz|ua)$`.
                        Defends against open-redirector abuse (any
                        host smuggled in via `?portal=` would otherwise
                        end up in the 302 Location).

Happy path:
- Mints two independent 32-byte hex nonces — `state` (persisted in
  oauth_state with 5-min TTL) and `csrfCookie` (set as a first-party
  cookie). The cookie/state pair is what /callback will verify.
- Cookie attributes: `HttpOnly; Secure; SameSite=Lax; Path=/api/oauth/`.
  Lax allows the top-level navigation back from Bitrix24's authorize
  page; Path-scoping keeps the cookie out of `/mcp` and tool routes.
- 302 to `https://<portal>/oauth/authorize/` with `client_id`, `state`,
  `redirect_uri`, `scope`, `response_type=code`.

Logging (§11 taxonomy):
- `oauth.install.start`   (INFO, on entry — portal + clientId visible)
- `oauth.install.deny.flag-off`       (WARN)
- `oauth.install.deny.not-configured` (ERROR — operator misconfig)
- `oauth.install.deny.portal-format`  (WARN — caller error / probe)
- `oauth.install.ok`      (INFO — `statePrefix` is the first 8 hex
                           chars of the nonce only; the full value is a
                           secret bound to the CSRF check, per §11
                           debug-trace policy)

The structured logger goes through the redactor extended in step 1, so
any future code path that accidentally feeds an OAuth URL into the
context (e.g. via a wrapped error) gets scrubbed before it reaches the
sink.

Tests (`tests/unit/api/oauth/install.test.ts`, 23 cases):
- Each gate path → expected statusCode + errorCode in body.
- 7 happy-portal hostnames accepted; 7 invalid (empty, non-bitrix24,
  javascript scheme, unlisted TLD `bitrix24.us`, no subdomain,
  path/port injection) → 400.
- 302 to authorize URL with all expected query params.
- oauth_state row persisted with portal + clientId + 5-min TTL.
- Cookie set with the documented attributes + Max-Age=300.
- Cookie value MATCHES the persisted csrfCookie (the pairing the
  callback will verify — pinned so a future split doesn't break it).
- Full state nonce NEVER appears in any logged context — only the
  8-char `statePrefix`.
- Two sequential calls produce distinct state + cookie nonces (no
  reuse).

The tests drive a real h3 `createApp` + `toNodeListener` so `setCookie`,
`sendRedirect`, and `createError` go through the actual h3 code paths
— no shortcuts.

No other code paths touch this route yet. Callback handler lands in
the next commit; the Bearer middleware that consumes `mcp_tokens` lands
after that.
…step 4-5)

Closes #211 in spirit (the scheduler the §11 contract demanded for the
oauth_state table); the health endpoint per §11 lands together so the
operator-tier observability surface arrives in one piece.

## /api/oauth/_health (step 4 of PR-2c)

Operator-tier counts endpoint — no PII, no tokens, no portal hosts.
Returns:
  {
    enabled: true,
    dbPath: '/data/oauth.sqlite',
    tenants: 12,            // count of oauth_tokens rows
    bearers: 47,            // count of active mcp_tokens (revoked_at IS NULL)
    pendingStates: 3,       // count of oauth_state rows with expires_at > now
    lastRefreshOk: null,    // stub — PR-2c step 6 (B24OAuth factory) populates
    lastRefreshFail: null,  // ditto
  }

Three new prepared statements on TokenStore + a `getHealthCounts()`
verb. All three queries run in one synchronous bundle (`better-sqlite3`
is sync) so the endpoint is a single round-trip.

Authentication — fails closed (§11 mandate). Two acceptable patterns:
  1. Network isolation: request from 127.0.0.1 / ::1 (nginx
     allow/deny upstream). No application-level token; ops infra owns
     access. The recommended pattern for the reference docker-compose
     setup.
  2. Dedicated `NUXT_BITRIX24_OAUTH_ADMIN_TOKEN` env var. Constant-time
     compared against `Authorization: Bearer`. Use this if network
     isolation is infeasible (shared single-host setups).

Failure modes (all loud, with errorCode in the JSON body):
  - 503 FLAG-OFF             — NUXT_BITRIX24_OAUTH_ENABLED=false.
  - 503 NOT-CONFIGURED       — neither admin token nor localhost.
                                Default config ships closed.
  - 401 ADMIN-TOKEN-MISSING  — token configured, no Bearer header.
  - 401 ADMIN-TOKEN-INVALID  — Bearer present but wrong.

NEVER falls back to `NUXT_MCP_AUTH_TOKEN` (the agent's token) — pinned
by a dedicated test. Privilege separation: a compromised agent
(prompt injection, jailbroken DXT) must not read fleet-level counts.

Once the operator opts in to token-based auth, the route is uniformly
token-gated even from localhost — no implicit bypass. A dev box must
include the Bearer too.

## pruneExpiredStates scheduler (step 5 of PR-2c, issue #211)

`oauth_state` rows live 5 minutes (the install→callback TTL). Rows
that never get consumed (browser closed, link not clicked, `/install`
spam) accumulate without cleanup. The schema-bootstrap plugin
(`server/plugins/oauth-schema.ts`) now arms a 5-minute setInterval:

  - First tick at +5 min, NOT on boot.
  - Logs `oauth.state.prune.ok` ONLY when rows > 0 (no log spam on
    empty buckets).
  - Logs `oauth.state.prune.fail` on transient SQLite errors (disk
    full, lock contention) and CONTINUES — next tick retries.
  - Cleared via Nitro's `close` hook so SIGTERM / dev hot-reload don't
    leave a zombie timer.
  - `timer.unref()` so a hanging timer doesn't keep the process alive
    if the close hook didn't fire (defensive).

Flag-off path: zero behaviour change. No timer armed, no close hook
registered, no `useTokenStore()` call.

`PRUNE_INTERVAL_MS` exported as `@internal` for testability.

## .env.example + nuxt.config.ts

Added `bitrix24OauthAdminToken: ''` to runtimeConfig (NUXT_BITRIX24_OAUTH_ADMIN_TOKEN).
Documented in the JSDoc that this token MUST NOT fall back to
mcpAuthToken — privilege levels differ.

## Tests

- `tests/unit/api/oauth/_health.test.ts` (14 cases): each gating branch
  + counts shape + dbPath derivation + log-payload assertion + the
  MCP_AUTH_TOKEN privilege-separation invariant + the localhost+token
  "token wins" behaviour.

- `tests/unit/utils/oauth-schema.test.ts` (+5 cases on top of the
  existing 3): timer arms at +5 min, logs on >0 prunes, silent on 0
  prunes, survives a prune throw, close-hook clears the timer,
  flag-off arms NOTHING.

Total: 612 passed, 1 skipped, 2 todo. Typecheck + eslint clean.

Refs: #211, §11.
The user-visible half of the install→callback handshake. Bitrix24
redirects the user here after the authorize page; this handler verifies
the state nonce, exchanges the authorization code for tokens, persists
the tenant + mints a Bearer, and renders an HTML page that shows the
Bearer ONCE with paste instructions.

Verification chain (each failure → distinct errorCode + 4xx):
  - PARAMS-MISSING            — code or state absent from the URL.
  - STATE-MISSING             — state not in oauth_state OR expired.
                                Single atomic `DELETE … RETURNING`
                                (PR-2b) — re-using a state is a hard
                                no.
  - STATE-COOKIE-MISMATCH     — CSRF cookie doesn't match the persisted
                                csrfCookie (§8 #2).
  - STATE-PORTAL-MISMATCH     — Bitrix24's callback `?domain=` disagrees
                                with the install portal (replay across
                                portals defence).
  - STATE-CLIENT-MISMATCH     — clientId rotated between install +
                                callback (config drift defence).

Token exchange (POST to `https://oauth.bitrix24.tech/oauth/token/`):
  - EXCHANGE-NETWORK    (502) — fetch rejected (DNS / ECONNREFUSED).
  - EXCHANGE-NON-JSON   (502) — response body wasn't JSON (upstream
                                maintenance page).
  - EXCHANGE-FAIL       (502) — Bitrix24 returned `{ error: ... }` or
                                non-2xx status. `invalid_grant` is the
                                canonical "code reused / expired" path.
  - EXCHANGE-BAD-USER-ID (502) — Bitrix24 returned a non-numeric user_id
                                (defence in depth).

The `client_secret` is sent in the POST body (form-urlencoded) so it
never appears in any URL-shaped log line even by accident. The redactor
extended in step 1 catches OAuth secrets in BOTH URL-query and
JSON-body shapes — so even an `err.message` carrying a fetched URL or
a logged response body gets scrubbed.

Happy path:
  1. `upsertTokens` writes oauth_tokens (audit-first via the PR-2b
     invariant — recordAuditEvent BEFORE the SQLite write).
  2. `createMcpToken` mints a Bearer for the (member_id, user_id) pair.
     The persisted portal hostname is used as the default label so the
     PR-2c-pending "list my Bearers" operator tool (issue #212) can
     identify it without the user naming it.
  3. The CSRF cookie is cleared (`deleteCookie` emits `Max-Age=0`).
  4. Response headers: `Cache-Control: no-store, no-cache`, `Pragma:
     no-cache`, `Content-Type: text/html; charset=utf-8`. The Bearer is
     embedded in a `<pre>` block, never in a URL.
  5. HTML page is intentionally JS-free (no external script-src) — the
     operator pastes the token manually.

Logging (§11 taxonomy):
  - oauth.callback.start                       (INFO — statePrefix 8 chars only)
  - oauth.callback.deny.<reason>               (WARN/ERROR)
  - oauth.callback.exchange.fail               (ERROR — Bitrix24 error
                                                code, NOT error_description
                                                or raw URL/body)
  - oauth.callback.exchange.ok                 (INFO — memberId, userId,
                                                bearerHashPrefix 'sha256-'+8 hex,
                                                portal; NEVER the raw Bearer)

Tests (`tests/unit/api/oauth/callback.test.ts`, 16 cases):
  - Each verification branch with a distinct seed scenario.
  - Each exchange failure mode with a corresponding `fetchMock` shape
    (network reject, invalid_grant, 5xx, non-JSON, bad-user-id).
  - State is consumed atomically — second callback with same state →
    STATE-MISSING.
  - Happy path: oauth_tokens row landed with correct fields, mcp_tokens
    Bearer minted, the round-trip `hashBearer(raw) === findByBearerHash`
    invariant holds.
  - The raw Bearer NEVER appears in any logged context (full set scan).
  - The POST body shape is pinned: grant_type, client_id, client_secret,
    code, redirect_uri.

Fetch is mocked via `vi.stubGlobal('fetch', vi.fn())` instead of `msw`
— one fetch site, no need for the dev-dependency surface.

Total: 628 passed (+16 callback). Typecheck + eslint clean.

Refs: §3, §8, §11.
`useBitrix24OAuth(memberId, userId): B24OAuth` is the per-tenant
client factory `docs/OAUTH-DESIGN.md §5 + §7` promised. Used by
`useBitrix24Tenant()` when `NUXT_BITRIX24_OAUTH_ENABLED=true` and a
tenant context is bound; tools never call it directly.

## Caching strategy

- In-memory LRU keyed on `${memberId}:${userId}`, capped at 100
  entries (§5 sizing — covers the realistic per-process working set).
  Map-based with promote-on-access (delete + re-set preserves
  insertion order which IS the LRU order).
- Synchronous construction. JavaScript's single-threaded event loop
  means two callers cannot be mid-init for the same key
  simultaneously, so no mutex is needed; the first caller's `lruSet`
  settles before the next caller's `lruGet` runs. The §5 "per-tenant
  mutex" narrative applies only to the REFRESH path inside the SDK
  (`setCustomRefreshAuth` is async-callable by the SDK), and that's
  handled by the callback below.
- Cache eviction caveat (§5 known edge case): if an entry is evicted
  while a refresh is in flight inside the SDK, a concurrent call after
  eviction creates a second instance, both refresh attempts write the
  same new tokens (idempotent). Eviction is logged so the LRU can be
  sized correctly.

Keeping the factory sync also keeps `useBitrix24Tenant()` sync —
which avoids touching the 20+ tool call sites that PR-2d migrated.
The async-ness in §7's draft signature was design-doc-driven but
isn't actually needed: B24OAuth construction is sync, and refresh is
already async via the SDK's setCustomRefreshAuth contract.

## Refresh handling — `setCustomRefreshAuth`

Chose `setCustomRefreshAuth` over `setCallbackRefreshAuth` so the HTTP
exchange happens on OUR fetch (testable via `vi.stubGlobal('fetch', ...)`),
and the error classification is direct:

- `invalid_grant` → `markRefreshFailed(memberId, userId)` stamps every
  active Bearer `revoked_at`. Evicts the cached instance so a
  re-authorise lands a fresh one. The user re-authorises via /install.
- Network error / 5xx → log + re-throw WITHOUT `markRefreshFailed`.
  Refresh failure may be transient (DNS, brief outage); Bearers stay
  alive.

Process-local `refreshStatus` tracker captures `lastRefreshOk` /
`lastRefreshFail` timestamps. Exported as `_readRefreshStatus()` for
the `/api/oauth/_health` endpoint to populate the corresponding fields
(currently stubbed `null` — wired into _health in the next commit).

## Logging (§11 taxonomy)

- `oauth.refresh.start`               (INFO)
- `oauth.refresh.ok`                  (INFO — new accessExpiresAt)
- `oauth.refresh.fail.invalid-grant`  (ERROR — refresh-token death)
- `oauth.refresh.fail.transient`     (ERROR — network / 5xx / non-JSON)
- `oauth.factory.lru.evicted`         (DEBUG — eviction signal so the
                                       operator can size the LRU)

The raw `refresh_token` / `access_token` / `client_secret` NEVER appear
in any logged context — verified by an end-to-end log-scan test.

## Tests (tests/unit/utils/bitrix24-oauth.test.ts, 11 cases)

Caching (6):
- Returns a `B24OAuth` instance for an existing tenant.
- Two sequential calls → same instance (identity).
- N=10 concurrent calls → all share one instance (sync construction
  in JS's single-threaded loop = no race window).
- Different tenants → different instances.
- Missing tenant row → throws `oauth_tokens row missing`.
- Missing CLIENT_ID → throws.

Refresh (4):
- HTTP POST hits `oauth.bitrix24.tech/oauth/token/` with correct
  `grant_type=refresh_token` body. New tokens persisted via
  `upsertTokens('refresh')`. `lastRefreshOk` non-null. `oauth.refresh.ok`
  logged.
- `invalid_grant` → `markRefreshFailed` evicts active Bearers, cache
  evicted (subsequent call gets a fresh instance), `lastRefreshFail`
  non-null, `oauth.refresh.fail.invalid-grant` logged.
- 5xx → re-throws WITHOUT touching Bearers (transient class),
  `oauth.refresh.fail.transient` logged.
- Log-scan invariant: raw `refresh_token`, `access_token`, `client_secret`
  values NEVER appear in any logged context.

Status tracker (1):
- `_readRefreshStatus()` starts both fields at `null` on a fresh
  process.

The test sniffer pattern (`Object.defineProperty` on
`B24OAuth.prototype.setCustomRefreshAuth`) captures the factory's
refresh callback so it can be invoked directly — equivalent to "the
SDK detected expiry and called refresh" without round-tripping through
the SDK's internal expiry check.

Refs: §5, §7, §11.
@IgorShevchik IgorShevchik force-pushed the claude/oauth-pr2c-install-callback branch from 61cc972 to 0cca4d4 Compare June 5, 2026 13:06
…ep 8)

`useBitrix24Tenant()`'s OAuth-on branch previously threw "OAuth path
not yet implemented (lands in PR-2c)" with the tenant ids on the
operator-side logger. This commit replaces the throw with the real
`useBitrix24OAuth(tenant.memberId, tenant.userId)` call — the per-tenant
B24OAuth factory from step 7. Tools that have been routing through the
dispatcher since PR-2d (#213) now resolve to per-tenant clients when
the flag is on, without a single tool-side edit.

Wired changes:

- `server/utils/bitrix24-tenant.ts`:
  - Import `useBitrix24OAuth` from `~/server/utils/bitrix24-oauth`.
  - OAuth-on + tenant-bound path now `return useBitrix24OAuth(memberId, userIdNum)`.
  - `tenant.userId` is a string in `TenantContext` (matches audit-log
    shape); the factory expects a number. Defensive `Number.parseInt`
    with a `Number.isFinite` guard that throws + logs
    `oauth.tenant.dispatch.bad-user-id` if a non-numeric value slips
    through. In practice the Bearer middleware will only construct
    `TenantContext` from a verified `findByBearerHash` result whose
    `user_id` column is INTEGER NOT NULL, so this is belt-and-braces.
  - JSDoc rewritten — the "not yet implemented" caveat is gone; the
    behaviour description now matches the wired-up reality.
  - The no-tenant-scope branch also logs via the structured logger now
    (`oauth.tenant.dispatch.no-tenant-scope`) for consistency with the
    bad-user-id branch.

- `server/api/oauth/_health.get.ts`:
  - Import `_readRefreshStatus` from `~/server/utils/bitrix24-oauth`.
  - `lastRefreshOk` / `lastRefreshFail` now reflect the factory's
    process-local tracker (both `null` on a fresh process, correctly
    signalling "no refresh has been attempted yet"). The log payload
    includes the refresh status fields alongside counts so a single
    `oauth.health.ok` log line carries the full health snapshot.

Tests:

- `tests/unit/utils/bitrix24-tenant.test.ts`:
  - Mock `~/server/utils/bitrix24-oauth` with a sentinel client so the
    dispatcher's tenant-to-client resolution is controllable without
    seeding a real SQLite store from inside the dispatcher's unit test.
  - Replaced "throws not yet implemented" tests with the corresponding
    wired-up assertions: factory called with correct args, dispatcher
    returns the factory's result.
  - Added a non-numeric-userId guard test.
  - The N=10 concurrent test now pins the bijection on factory call
    arguments instead of error-message payloads (cross-tenant leak guard
    survives at the new layer).

- `tests/unit/tools/users/current-user.test.ts`:
  - The "tenant-scope-bound branch" test now asserts on the factory's
    `row missing` error (the realistic failure with no oauth_tokens
    seed) rather than the now-removed "not yet implemented" string.

Total: 640 passed (+1 dispatcher test rename, -1 factory wiring
replaces the stub).

Refs: §7, §11, #214.
@IgorShevchik IgorShevchik changed the title feat(auth): OAuth install/callback handshake + §11 observability surface (PR-2c, WIP draft) feat(auth): oauth install/callback handshake + §11 observability surface (PR-2c, WIP draft) Jun 5, 2026
claude added 4 commits June 5, 2026 13:24
Three CI regressions on the previous push, all fixed:

1) **Commit messages** check ran on the PR title (not the commits) and
   flagged 'OAuth' as pascal-case. PR title updated to 'oauth install/
   callback handshake + §11 observability surface (PR-2c, WIP draft)'
   via the GitHub MCP tool — separately from this commit, since the
   title isn't a repo artefact. This commit doesn't carry the title
   fix; mentioned for traceability.

2) **Unit tests** — `Cannot find package 'h3'` in the three new
   `tests/unit/api/oauth/*.test.ts` files. h3 is a transitive Nuxt
   dependency, accessible during local dev (pnpm hoists it), but CI's
   stricter ESM resolution refuses to load a direct import that isn't
   declared in package.json. Added `h3` to devDependencies pinning to
   the version already in node_modules (1.15.11). The tests use h3's
   `createApp` + `toNodeListener` to drive routes through the real h3
   framework end-to-end (cookies, redirects, headers); mocking h3's
   internals would be brittle.

3) **Unit tests** — `EACCES: permission denied, mkdir '/data'` in
   `tests/unit/tools/users/current-user.test.ts`. The flag-on guard
   test runs the dispatcher → factory → `useTokenStore()` chain, and
   the lazy `useTokenStore()` open tries to mkdir `/data` on the CI
   runner. The factory's behaviour is already exhaustively covered by
   `tests/unit/utils/bitrix24-oauth.test.ts`; the catalogue test only
   needs to assert that the catalogue surfaces the factory's loud
   failure. Mocked `~/server/utils/bitrix24-oauth.useBitrix24OAuth` to
   throw a stable `oauth_tokens row missing` error so the test asserts
   on shape without touching the filesystem.

Local validation:
- `npx vitest run` — 640 passed, 1 skipped, 2 todo.
- `npx nuxt typecheck` — clean.
- `npx eslint .` — clean.

No production code changed; only test isolation + devDep declaration.
After the previous CI-fix push, `tests/unit/api/oauth/callback.test.ts`
still failed with EACCES on `mkdir /data` from `audit-log.ts:248`. The
callback handler calls `store.upsertTokens()` and `store.createMcpToken()`,
both of which trigger `recordAuditEvent` (the PR-2b audit-first
invariant). The real audit-log lazily creates
`${NUXT_AUDIT_DIR ?? /data/audit}/` on first write — fails on a CI
runner without /data write permission, and the 500 from the handler
cascades into three assertion failures on the happy-path tests.

Same hazard would apply to `_health.test.ts` once the "counts" test
runs against an environment without /data; defensive-mock there too.

The audit-first invariant is already exhaustively tested in
`token-store.test.ts` via the same `vi.hoisted` + `vi.mock` pattern —
copying that pattern into the two route tests gives uniform isolation
without changing what the route tests actually assert (statusCode,
headers, body, persisted DB state).

`install.test.ts` doesn't trigger audit (only `createState` which is
not audit-first), so it stays unchanged.

Local: 640 passed, typecheck + eslint clean.
…ow-up)

After the previous CI-fix push (403cac9), `tests/unit/utils/bitrix24-oauth.test.ts`
was still leaving 7 unhandled rejections from `audit-log.ts:248`
(EACCES on `mkdir /data`). The factory tests call
`store.upsertTokens(SAMPLE_TENANT, 'install')` in setup + `createMcpToken`
in body; both trigger `recordAuditEvent` via the PR-2b audit-first
invariant. The audit-log lazily writes to `${NUXT_AUDIT_DIR ?? /data/audit}/`,
which fails with EACCES on CI runners without /data write permission.

Same `vi.hoisted` + `vi.mock('~/server/utils/audit-log', ...)` pattern as
`token-store.test.ts`, `callback.test.ts`, `_health.test.ts`. The audit-
first invariant itself is exhaustively tested in `token-store.test.ts`;
the factory tests only need the DB-side effects.

Last test file in this PR that exercises real `recordAuditEvent` — every
other OAuth-related test either mocks audit-log directly or mocks the
factory/token-store layer above it.

Local: 640 passed, typecheck + eslint clean.
Two single-file run-it-and-paste-the-output scripts for the operator
manually verifying the new OAuth install/callback/health surface
without needing a real Bitrix24 portal:

- scripts/manual-qa-pr2c.sh   (bash, Linux/macOS/WSL/Git-Bash)
- scripts/manual-qa-pr2c.ps1  (PowerShell, Windows native)

Both detect which scenario the running server is in by probing
/api/oauth/install with no params, then run the appropriate assertion
set:

  - Scenario A (flag off, default): /api/health still 200; OAuth
    endpoints refuse with errorCode FLAG-OFF. This is the
    pre-deployment smoke — proves the OAuth diff is byte-identical
    to webhook-only behaviour.
  - Scenario B (flag on, configured): portal allow-list rejects
    junk; happy install → 302 + CSRF cookie with HttpOnly +
    SameSite=Lax; callback gates work (PARAMS-MISSING, STATE-MISSING);
    /api/oauth/_health gating (200 from localhost, 401 with token).
  - Scenario C (flag on, missing CLIENT_ID/REDIRECT_URL): the script
    detects + names the missing config so the operator knows where
    to look.

Real Bitrix24 round-trip (install → authorize → callback with a real
code) is OUT of scope for the scripts — that needs a registered
Marketplace app + a real user. The scripts pin the gates that DON'T
need it, which catches every regression class except "the actual
token-exchange POST fails", and that is covered by the
`tests/unit/api/oauth/callback.test.ts` fetch-mock suite.

Exit code 0 if all assertions pass, 1 if any failed, 2 if the script
couldn't reach the server.

Usage:
  pnpm dev   # or: docker compose up
  ./scripts/manual-qa-pr2c.sh
  # or on Windows:
  .\scripts\manual-qa-pr2c.ps1
@IgorShevchik IgorShevchik changed the title feat(auth): oauth install/callback handshake + §11 observability surface (PR-2c, WIP draft) feat(auth): oauth install/callback handshake + b24oauth factory (PR-2c) Jun 5, 2026
ShellCheck CI flagged `exit $([ $FAIL -eq 0 ] && echo 0 || echo 1)` at
line 180 (SC2046: command substitution without quotes is word-split).
Replaced the cleverness with a plain if/else — clearer and lint-clean:

  if [ "$FAIL" -eq 0 ]; then exit 0; else exit 1; fi

`bash -n` passes; the other command substitutions in the file are all
assigned to variables (not word-split), so they're safe.
claude added 3 commits June 5, 2026 15:21
Five Sonnet agents (docs/programmer/tester/security/CTO) reviewed the
install/callback/factory surface. Consolidated fixes, all in this PR
except the Bearer-middleware tracking issue (#217, agreed split).

SECURITY (Security #1, 🔴) — X-Forwarded-For localhost spoof on _health:
  `getRequestIP(event, { xForwardedFor: true })` trusted a client-
  supplied header, so `curl -H 'X-Forwarded-For: 127.0.0.1'` from the
  internet passed the localhost gate and read fleet counts. Switched to
  the raw socket IP (`getRequestIP(event)`). New test proves the spoof
  now gets 503.

SECURITY/Programmer (🔴) — _health leaked dbPath:
  Removed `dbPath` from the response body — a filesystem path is infra
  topology that aids a post-auth attacker / a misconfigured-nginx
  exposure. Added `processStartedAt` instead so a dashboard can tell
  "lastRefreshOk null because just restarted" from "null because no
  refresh ever ran" (CTO #3).

Docs/Programmer (🔴) — STATE-EXPIRED vs STATE-MISSING:
  `consumeState` folded expired + never-existed into one `undefined`,
  so `oauth.callback.deny.state-expired` was unemittable and a slow
  user looked identical to a probe in the logs. `consumeState` now
  returns the row regardless of expiry (still deletes it — replay
  protection); the callback checks `expiresAt` and emits STATE-EXPIRED
  (INFO) vs STATE-MISSING (WARN) with distinct errorCodes.

Docs (🔴) — §11 taxonomy didn't match the code:
  Rewrote the §11 event taxonomy to match every emitted event
  (install/callback/health/refresh/dispatch), removed the phantom
  `portal-host` (the one regex covers format + TLD), added the
  `mcp.auth.deny.*` block as explicitly "(deferred)". §10 rollout table
  updated: PR-2c is #216, Bearer middleware split to #217.

Docs (🔴) — NUXT_BITRIX24_OAUTH_ADMIN_TOKEN undocumented:
  Added it to `.env.example` with the privilege-separation rationale +
  the multi-instance scaling caveat (CTO #5). Without this an operator
  hit a silent 503 with no idea which variable to set.

Security (🟡) — CSRF cookie compare:
  `!==` → constant-time `timingSafeEqual` (matches the _health admin
  token compare).

Security (🟡) — member_id validation:
  Bitrix24's `member_id` is now regex-validated (`^[a-zA-Z0-9._:-]{1,128}$`)
  before it becomes a SQLite PK + log field — a compromised upstream
  can't smuggle a path-traversal / over-long string. New EXCHANGE-BAD-MEMBER-ID
  errorCode + test.

Security (🟡) — domain binding optional:
  When Bitrix24 omits `?domain=`, the portal binding can't be checked;
  now logs `oauth.callback.domain-absent` (WARN) so an operator can spot
  a flow missing the binding. The other three §8 checks still hold.

Programmer (🟡) — callback HTTP semantics:
  Bitrix24 5xx → 503 (retryable upstream outage), 4xx/explicit error →
  502. Was uniformly 502. New tests for both.

Programmer (🟡) — OAUTH_JSON_RE under-redaction:
  `[^"\\]+` stopped at a stray backslash and leaked the token tail.
  Changed to `[^"]+` (OAuth tokens never contain a literal quote).
  Corrected the misleading comment. New fixture test.

Tester (🔴) — LRU eviction untested:
  Added `_setLruMaxForTests` hook + two tests: eviction-at-capacity logs
  `oauth.factory.lru.evicted`, and cache-hit promotes to MRU (true LRU
  order).

Tester (🟡) — audit-first at the handler boundary:
  New test: a failing `recordAuditEvent` aborts the callback BEFORE a
  Bearer lands (the end-to-end audit-first contract, previously only
  pinned at the token-store layer).

Tester (🟡) — privilege-separation completeness:
  New test: even with mcpAuthToken set to a value, the admin gate checks
  the admin-token FIELD — a bare string match can't sneak the agent
  token through.

Programmer (🟢) — getRequestId @throws JSDoc; LRU `cache.has` no-op
removed; comment cleanups.

Tests: 648 → 660+ (12 new), typecheck + eslint clean.

Refs: #217 (Bearer middleware follow-up).
The "Manual-QA scaffold sync" CI gate (SKILL.md Ground Rule 5) requires
that any `.env.example` change is mirrored in
`skills/run-manual-qa/references/issue-scaffold.md` in the same PR. The
round-3 fix added `NUXT_BITRIX24_OAUTH_ADMIN_TOKEN` to `.env.example`
but missed the scaffold — this commit closes that.

Also refreshes the OAuth-scaffolding paragraph to reflect reality:
install/callback + the B24OAuth factory have landed, but the `/mcp`
Bearer middleware is still pending (#217), so a minted Bearer isn't
accepted by `/mcp` yet. Last-reviewed stamp bumped to 2026-06-05.
Final review round (5 Sonnet agents). CTO verdict: ship after these.
All fixes in this PR; no new deferrals.

Docs (🔴) — §11 drift from code:
  - The _health response example still showed `dbPath` (removed in
    round-3); replaced with `processStartedAt`, added a "no dbPath"
    note.
  - The errorCode rule ("suffix after the last dot") didn't cover the
    compound `EXCHANGE-*` codes on `oauth.callback.exchange.fail`;
    documented the exception explicitly.
  - Three references to `NUXT_OAUTH_ADMIN_TOKEN` were missing the
    `BITRIX24` segment — corrected to `NUXT_BITRIX24_OAUTH_ADMIN_TOKEN`
    (an operator copying the typo would have hit a silent 503).
  - Added the `oauth.callback.state-row-corrupt` event to the taxonomy.

Security (🟡) — _health loopback range too narrow:
  `LOCALHOST_IPS` was a 3-element Set; a legitimate orchestrator probe
  from `127.0.0.2` (the whole 127.0.0.0/8 is loopback per RFC 5735) got
  503. Replaced with an `isLoopback()` range check. New test.

Security (🟡) — XSS defence-in-depth on the error page:
  `callbackErrorPage` escaped `detail` but not `errorCode`. Both are
  code literals today, but a future refactor passing a Bitrix24-
  controlled `exchange.error` into that slot would reflect unescaped.
  Extracted a shared `htmlEscape()` and applied it to both slots (and
  reused it in `bearerSuccessPage`).

Security (🟢→doc) — X-Forwarded-For deployment caveat:
  The raw-socket-IP fix is complete for the `node-server` preset, but
  edge presets (Cloudflare/Vercel) populate `event.context.clientAddress`
  from forwarded headers. §11 now documents that forks on an edge preset
  MUST use admin-token mode, not localhost isolation.

Programmer (🟡) — corrupt-DB CSRF guard:
  `timingSafeEqual('', '')` returns true. If a corrupt state row had an
  empty `csrf_cookie`, a request with NO cookie would pass. Added an
  explicit non-empty guard → 500 STATE-ROW-CORRUPT before the compare.

Programmer (🟡) — STATE-EXPIRED boundary:
  Documented the intentional one-second skew between the callback's
  `< now` accept and `_health`'s `> now` pending-count (cosmetic against
  the 5-min TTL).

Tester (🔴) — audit-first handler test strengthened:
  Now asserts the exact 500 status AND counts `mcp_tokens` rows == 0
  (proves NO Bearer landed, not just no oauth_tokens row).

Tester (🟡) — X-Forwarded-For test now distinguishes mechanisms:
  Added a case: remote IP + spoofed XFF=127.0.0.1 + admin token set →
  still 401 (the spoof didn't grant a localhost bypass), which the
  503-no-config path alone didn't prove.

Tester (🟡) — processStartedAt asserts a sane recent value (catches a
  ms-vs-seconds regression), not just `typeof === 'number'`.

Tester (🟢) — domain-absent branch now tested: a callback with no
  `?domain=` succeeds (other 3 §8 bindings hold) and logs
  `oauth.callback.domain-absent`.

PR body Changelog framing expanded to list the round-3/4 security
fixes (X-Forwarded-For, dbPath removal, member_id validation, CSRF
constant-time, HTML escaping) — these are operator-relevant history.

Tests: 652 passed (+7), typecheck + eslint clean.

Refs: #217 (Bearer middleware follow-up).
@IgorShevchik IgorShevchik marked this pull request as ready for review June 5, 2026 16:10
@IgorShevchik IgorShevchik merged commit 5842472 into main Jun 5, 2026
17 checks passed
IgorShevchik pushed a commit that referenced this pull request Jun 5, 2026
Five Sonnet agents reviewed the Bearer middleware PR. CTO verdict was
"ship", all critical findings in-PR. Programmer/Security/Tester each
caught 2-3 actionable items.

Docs (stale text now that the middleware is live):
  - §11 dropped "(once the Bearer middleware lands)" — the WWW-Authenticate
    header is set NOW, not "future". Added an explicit example of the
    header shape.
  - §10 step 5: removed "callback HTML page warns the user about this"
    (the warning was deleted earlier in this PR) — now reads as historical
    context ("between #216 and #217 the page warned about Y").
  - §7 token-store inventory updated: inspectBearer added next to
    findByBearerHash with the rationale for the two-verb split.
  - §6 mcp-auth.ts description rewritten to reflect the toolkit-vs-h3
    middleware split + the three distinct errorCodes.
  - §10 step 8 (PR-5) gained the operator-facing migration warning
    requirement from the CTO review: "NUXT_MCP_AUTH_TOKEN is bypassed on
    /mcp when OAUTH_ENABLED=true" must land in DEPLOYMENT.md before any
    public OAuth announcement.

Tester (3 must-fix):
  - ALS scope cleanup test: assert getTenantContext() is undefined
    OUTSIDE the middleware wrap. Defends against a future als.enterWith
    regression that would leak tenant context across requests.
  - WWW-Authenticate full RFC 6750 §3 shape now pinned (Bearer
    error="invalid_token" + errorCode + error_description) on
    BEARER-REVOKED and BEARER-ORPHAN tests, not just the errorCode
    substring.
  - BEARER-REVOKED takes precedence over BEARER-ORPHAN test added —
    catches a future refactor that swaps the two if-blocks.
  - Concurrent N=2 cross-tenant ALS-isolation test added: two parallel
    middleware invocations with different Bearers each see THEIR OWN
    tenant inside next(). The middleware-layer counterpart of PR-2a's
    #64 ALS spike.

Programmer:
  - Replaced `denyBearer(...)` calls with `return denyBearer(...)` so
    TypeScript narrows `token` to string without a non-null assertion.
    The `token!` on the hash construction is gone.
  - Added a one-line comment on the String(inspection.userId)
    coercion documenting the PR-2a TenantContext.userId-is-string
    contract.

Security:
  - wwwAuthHeader now escapes backslash and quote in `code` /
    `description` defensively (current callers pass literals, but a
    future refactor threading a Bitrix24-controlled value into
    `description` shouldn't be able to inject extra header attributes).
  - server/middleware/mcp-auth.ts no longer yields unconditionally on
    flag=true — defence-in-depth: it requires AT LEAST the
    `Authorization: Bearer …` shape before yielding. Worst case if the
    toolkit middleware is missing (HMR / module-init bug), /mcp still
    returns 401, not an auth bypass.

Tests: 667 passed (+4 round-1 cases), typecheck + eslint clean.
Round 1 is the FINAL review round per CTO verdict.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PR-2c: requestId — populate unconditionally + getRequestId() throwing helper PR-2c: wire pruneExpiredStates scheduler (5-min interval)

2 participants