Skip to content

Harden gateway connection and pairing flows#741

Merged
ranjeshj merged 6 commits into
openclaw:mainfrom
ranjeshj:user/ranjeshj/connectionfixes
Jun 12, 2026
Merged

Harden gateway connection and pairing flows#741
ranjeshj merged 6 commits into
openclaw:mainfrom
ranjeshj:user/ranjeshj/connectionfixes

Conversation

@ranjeshj

Copy link
Copy Markdown
Collaborator

Summary

Harden Windows companion gateway connection, setup-code pairing, node pairing, token recovery, and browser-control auth flows.

Key changes:

  • keep operator and node WebSocket lifecycles separate, with setup-code operator connects using role=operator
  • persist all returned role handoff tokens from hello-ok.auth.deviceTokens[]
  • clear bootstrap credentials only after required role tokens are durably readable
  • forward node device-token receipt from WindowsNodeClient through NodeConnector so bootstrap cleanup can complete after node reconnect
  • request and handle pairing scopes correctly, including response-aware node/device approval and admin-gated fallback
  • prevent stale node clients from mutating current state and abort node handshakes when capability registration fails
  • preserve shared-token HTTP/dashboard/browser-control semantics, including browser proxy registration only when a shared gateway token is available
  • add local MCP-only connection controls without exposing them through the gateway node transport
  • add WSL setup retry for transient post-terminate /etc/wsl.conf read timing while keeping config validation strict
  • add protocol research documentation and expanded connection/E2E coverage

Validation

  • ./build.ps1 passed
  • dotnet test ./tests/OpenClaw.Shared.Tests/OpenClaw.Shared.Tests.csproj --no-restore passed
  • dotnet test ./tests/OpenClaw.Tray.Tests/OpenClaw.Tray.Tests.csproj --no-restore passed
  • dotnet test ./tests/OpenClaw.Connection.Tests/OpenClaw.Connection.Tests.csproj --no-restore passed
  • dotnet test ./tests/OpenClaw.SetupEngine.Tests/OpenClaw.SetupEngine.Tests.csproj --no-restore passed
  • OPENCLAW_RUN_E2E=1 dotnet test ./tests/OpenClaw.E2ETests/OpenClaw.E2ETests.csproj -r win-arm64 --no-restore passed: 17/17
  • Copilot autoreview clean: no accepted/actionable findings
  • Dual-model protocol audit completed; accepted multi-role handoff finding fixed and revalidated

Notes

Draft PR for review. The changes intentionally avoid broad WebSocket lifecycle rewrites or gateway-side assumptions; fixes are scoped to client-side behavior verified against docs/CONNECTION_PROTOCOL_RESEARCH.md and docs/CONNECTION_ARCHITECTURE.md.

Consolidate the connection/pairing hardening work into one validated change set.

Connection and credential handling:
- keep gateway credentials registry-backed and preserve strict credential precedence: device token, shared gateway token, then bootstrap token
- force fresh setup-code bootstrap credentials for immediate QR/setup-code pairing while preserving shared gateway tokens for HTTP/dashboard paths
- dedupe loopback-equivalent gateway URLs so localhost and 127.0.0.1 records do not split pairing state
- validate replacement shared tokens before disconnecting or clearing durable device tokens
- clear stale bootstrap tokens only after required role tokens are durably readable
- recover stale operator device-token mismatches by falling back to bootstrap when recovery material is still present

Operator/node pairing and token lifecycle:
- keep operator clients in the operator role during bootstrap while preserving explicit node bootstrap behavior
- persist role-specific handoff tokens from hello-ok auth.deviceTokens[] for both operator and node roles
- forward WindowsNodeClient node-token receipt through NodeConnector so GatewayConnectionManager can complete bootstrap cleanup after the node token becomes durable
- request operator.pairing with normal shared-token operator connects so node trust approvals can be reached
- wait for node/device pair approval responses instead of treating a sent frame as success
- fall back from node.pair.approve to device.pair.approve only when admin authority is available
- guard node connection events by client generation so stale clients cannot mutate current state
- abort node handshake when pre-connect capability binding fails, preventing caps=0/cmds=0 registrations

Tray, MCP, and browser-control behavior:
- expose connection-control MCP tools only through local MCP, not the gateway node transport
- route MCP setup-code and shared-token connection tools through GatewayConnectionManager
- refresh gateway node state when local node connected/paired events arrive
- register browser.proxy only when a live gateway client and shared gateway token are available, and use the shared token for browser-control HTTP auth

Setup and reliability:
- add bounded retry for transient WSL startup timing when validating /etc/wsl.conf after WSL terminate/apply-config
- keep invalid wsl.conf content validation strict after the read succeeds
- preserve SSH tunnel behavior for operator and node connection paths

Maintainability simplifications:
- reuse setup-code gateway lookup state in GatewayConnectionManager
- centralize delayed reconnect scheduling with generation/disposal guards
- centralize response-aware pair approval RPC handling
- consolidate operator scope helper literals and checks

Validation:
- build.ps1 passed
- OpenClaw.Shared.Tests passed
- OpenClaw.Tray.Tests passed
- OpenClaw.Connection.Tests passed
- OpenClaw.SetupEngine.Tests passed
- full OpenClaw.E2ETests passed with OPENCLAW_RUN_E2E=1 and win-arm64 runtime
- targeted QR/setup-code E2E tests passed after audit follow-up
- Copilot autoreview passed with no accepted/actionable findings
- dual-model protocol audit completed; accepted multi-role handoff finding fixed

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ranjeshj ranjeshj force-pushed the user/ranjeshj/connectionfixes branch from 6b93d72 to 5d14408 Compare June 11, 2026 14:19
@clawsweeper

clawsweeper Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codex review: needs maintainer review before merge. Reviewed June 12, 2026, 11:33 AM ET / 15:33 UTC.

Summary
The PR hardens gateway operator/node lifecycles, pairing approvals, role-token persistence, browser-control authentication, local MCP controls, WSL validation, CI sharding, and connection E2E coverage.

Reproducibility: yes. for the central local setup and recovery failures: current source paths and automated Windows/WSL E2E provide high-confidence reproduction coverage. Remote gateway behavior remains unproven.

Review metrics: 2 noteworthy metrics.

  • Changed surface: 40 files, +4450/-378. Authentication, persisted credentials, setup, local controls, CI, and E2E infrastructure change together.
  • Automated E2E: 3 passing shards. Latest-head CI separately exercises setup/connect, revocation recovery, and network recovery.

Merge readiness
Overall: 🦪 silver shellfish
Proof: 🦪 silver shellfish
Patch quality: 🐚 platinum hermit
Result: blocked until real behavior proof from a real setup is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Attach redacted live proof for fresh pairing and an existing paired-gateway upgrade and recovery flow.
  • Obtain explicit maintainer acceptance of the credential replacement behavior or narrow that part of the patch.

Proof guidance:

  • [P1] Needs real behavior proof before merge: The inspectable evidence is automated Windows/WSL E2E output; the manual local-test claim has no attached redacted runtime output or recording, and remote paths are explicitly unverified. Add proof to the PR body, redact private endpoints, tokens, IP addresses, and other sensitive data, and request @clawsweeper re-review if updating the body does not refresh review automatically. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

  • [P1] Existing paired gateways intentionally enter new setup-code and explicit shared-token replacement transitions; maintainers must accept when durable device credentials may be replaced and re-pairing may occur.
  • [P1] Remote and externally hosted gateway pairing, token handoff, and recovery remain explicitly unvalidated, so local success does not settle those affected reports.
  • [P1] Automated E2E is strong supplemental validation, but no inspectable contributor-run artifact currently demonstrates the after-fix behavior on the claimed real setup.

Maintainer options:

  1. Prove existing-gateway upgrade (recommended)
    Attach redacted live output or a recording showing fresh setup-code pairing and a previously paired gateway surviving credential handoff, restart, and node reconnect without unintended credential loss.
  2. Narrow credential-policy changes
    Split forced-bootstrap and shared-token replacement policy from the independently useful lifecycle and stale-client fixes if their upgrade semantics cannot be approved now.
  3. Accept local-only validation
    Merge while explicitly owning that remote and external gateway pairing remains unverified and may require follow-up repair.

Next step before merge

  • [P2] The next action is contributor-owned live proof plus maintainer judgment on upgrade-sensitive credential replacement; no concrete mechanical code defect remains for an automated repair lane.

Security
Cleared: No concrete security or supply-chain regression was found; the credential-mutating controls remain on the existing loopback-only MCP surface and are not registered with the gateway node transport.

Review details

Best possible solution:

Retain the lifecycle separation, stale-client guards, response-aware approvals, validation-first replacement, and durable role-token handling, then merge only after inspectable live upgrade proof and explicit maintainer acceptance of the credential replacement semantics.

Do we have a high-confidence way to reproduce the issue?

Yes for the central local setup and recovery failures: current source paths and automated Windows/WSL E2E provide high-confidence reproduction coverage. Remote gateway behavior remains unproven.

Is this the best way to solve the issue?

Mostly yes: separating operator/node lifecycles, rejecting stale clients, distinguishing approval kinds, validating replacement credentials before destructive changes, and requiring durable role tokens are appropriate. The forced-bootstrap and credential-replacement policy still requires explicit upgrade acceptance or narrower scoping.

AGENTS.md: found and applied where relevant.

Codex review notes: model internal, reasoning high; reviewed against 913ba4e8f504.

Label changes

Label changes:

  • add rating: 🦪 silver shellfish: Overall readiness is 🦪 silver shellfish; proof is 🦪 silver shellfish and patch quality is 🐚 platinum hermit.
  • remove rating: 🧂 unranked krab: Current PR rating is rating: 🦪 silver shellfish, so this older rating label is no longer current.

Label justifications:

  • P1: The PR addresses active, user-blocking pairing and reconnect failures while changing authentication behavior for existing installations.
  • merge-risk: 🚨 compatibility: Existing paired records can follow new setup-code and shared-token replacement transitions that may require re-pairing or credential migration.
  • merge-risk: 🚨 auth-provider: The patch changes selection, validation, persistence, handoff, and retirement of bootstrap, shared, operator-device, and node-device tokens.
  • rating: 🦪 silver shellfish: Overall readiness is 🦪 silver shellfish; proof is 🦪 silver shellfish and patch quality is 🐚 platinum hermit.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The inspectable evidence is automated Windows/WSL E2E output; the manual local-test claim has no attached redacted runtime output or recording, and remote paths are explicitly unverified. Add proof to the PR body, redact private endpoints, tokens, IP addresses, and other sensitive data, and request @clawsweeper re-review if updating the body does not refresh review automatically. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
Evidence reviewed

What I checked:

  • Latest-head validation: The latest head has successful repository-hygiene, test, three Windows/WSL E2E shard, Socket, and x64/arm64 build checks. (.github/workflows/ci.yml:282, 0e6aa83eb63d)
  • Credential replacement safety: When durable role tokens exist, the patch validates a proposed shared token while bypassing stored device tokens before disconnecting, updating the registry, or clearing existing role tokens. (src/OpenClaw.Connection/GatewayConnectionManager.cs:553, 0e6aa83eb63d)
  • Credential policy risk: Setup-code application deliberately forces the fresh bootstrap credential for the immediate connection, while explicit shared-token replacement clears durable role tokens only after the replacement token validates. (src/OpenClaw.Connection/GatewayConnectionManager.cs:504, 0e6aa83eb63d)
  • Current-main provenance: Git blame ties the existing shared-token replacement and role-token clearing flow to the v0.6.3 baseline; this PR attempts to make that existing destructive transition validation-first. (src/OpenClaw.Connection/GatewayConnectionManager.cs:520, 85445c78066b)
  • Local MCP boundary: Repository source and MCP documentation keep MCP-only capabilities on the loopback server and separate from gateway node registration; no remote exposure of the new connection controls was established. (docs/MCP_MODE.md:1, 913ba4e8f504)
  • Proof limitation: The contributor describes manual local setup and reconnect testing and links successful automated E2E output, but supplies no attached redacted terminal output, logs, recording, or other inspectable artifact; the comment also states remote paths still need validation. (0e6aa83eb63d)

Likely related people:

  • Scott Hanselman: Git shortlog shows the largest contribution volume across the central connection manager, gateway client, and node service paths. (role: feature owner; confidence: high; files: src/OpenClaw.Connection/GatewayConnectionManager.cs, src/OpenClaw.Shared/OpenClawGatewayClient.cs, src/OpenClaw.Tray.WinUI/Services/NodeService.cs)
  • ranjeshj: Recent history shows substantial gateway connection and pairing work, including earlier bootstrap-role behavior and this branch's hardening. (role: recent area contributor; confidence: high; commits: 0f83031fdff8, 5d14408f3947; files: src/OpenClaw.Connection/GatewayConnectionManager.cs, src/OpenClaw.Shared/OpenClawGatewayClient.cs, tests/OpenClaw.E2ETests/Setup/SetupAndConnectTests.cs)
  • Christine Yan: Blame and symbol history associate the current shared-token replacement flow with the v0.6.3 baseline commit. (role: introduced behavior; confidence: high; commits: 85445c78066b; files: src/OpenClaw.Connection/GatewayConnectionManager.cs)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jun 11, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. P1 Urgent regression or broken agent/channel workflow affecting real users now. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. merge-risk: 🚨 auth-provider 🚨 Merging this PR could break OAuth, tokens, provider routing, model choice, or credentials. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. labels Jun 12, 2026
ranjeshj and others added 4 commits June 12, 2026 06:34
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The removed test used an admin fixture to approve node pairing, so it did not represent the QR-only external scenario it claimed to cover and could leave pending node approvals that broke later setup/connect tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ranjeshj

Copy link
Copy Markdown
Collaborator Author

@clawsweeper re-review

Proof update:

  • Manually tested the setup flow with the PR branch tray and local WSL gateway: generated a QR/setup code from the gateway, pasted it into the tray, observed the pairing/approval flow, and inspected tray + gateway state/logs during the run.
  • Manually tested local connection scenarios around setup-code pairing, shared-token reconnect, bad shared token handling, invalid setup code handling, and local gateway/tray reconnect behavior.
  • Added extensive end-to-end coverage for local setup/connect scenarios in CI. See the setup/connect E2E tests in this CI job: https://github.com/openclaw/openclaw-windows-node/actions/runs/27424642089/job/81059057254?pr=741

Remote/external connection paths still need follow-up coverage and validation; this PR is intended to harden and prove the local setup/connect paths first.

@clawsweeper

clawsweeper Bot commented Jun 12, 2026

Copy link
Copy Markdown

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. labels Jun 12, 2026
@ranjeshj ranjeshj marked this pull request as ready for review June 12, 2026 19:36
@ranjeshj ranjeshj merged commit ea36b12 into openclaw:main Jun 12, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 auth-provider 🚨 Merging this PR could break OAuth, tokens, provider routing, model choice, or credentials. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. P1 Urgent regression or broken agent/channel workflow affecting real users now. rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant