Skip to content

feat(auth): Epic D — OIDC web signup + multi-tenant Keycloak#17

Merged
hoangsnowy merged 7 commits into
mainfrom
feat/web-oidc-signup
May 30, 2026
Merged

feat(auth): Epic D — OIDC web signup + multi-tenant Keycloak#17
hoangsnowy merged 7 commits into
mainfrom
feat/web-oidc-signup

Conversation

@hoangsnowy
Copy link
Copy Markdown
Owner

Change description

Epic D — public self-service signup against multi-tenant Keycloak, with email verification, audit trail, and the four security red flags from the plan closed off.

8 commits on the branch since main (oldest first):

  1. 5226f57 Web OIDC client + public sign-up + drop operator JWT (already on the branch at plan time)
  2. 6f68c08 Tighten signup validation rules (Task A)
  3. e4e8cbb MailHog + email verify on signup (Task B)
  4. a139921 3-mode signup + saga rollback (Task C)
  5. e41f6fa Kill ROPC, secrets out of code, tighten cookies (Task D)
  6. 04e0339 Tenant admin members + settings pages (Task E)
  7. 9b4b8df Tenant audit trail + /admin/audit page (Task F)
  8. 05f8da4 Cover Epic D code paths (Task G)

Type of change

  • New feature (new agent / endpoint / capability)
  • Infrastructure / CI configuration

Security checklist

Five red flags from the plan, all closed:

# Red flag Closed by
1 Signups skip email verification e4e8cbb — realm verifyEmail:true + MailHog + Kc client triggers VERIFY_EMAIL action
2 directAccessGrantsEnabled: true on the code-flow client allows ROPC e41f6fa — flipped to false on agentic-web
3 RequireHttpsMetadata defaults to false when the setting is missing e41f6fa — default true, dev overrides via appsettings.Development.json
4 Cookie missing Secure policy e41f6faAlways outside Dev, SameAsRequest in Dev
5 agentic-web-dev-secret / admin/admin literals in .cs source e41f6fa — Aspire AddParameter + AppHost appsettings.json; grep --include=*.cs returns nothing

Decisions (autopilot defaults)

  • Invitations are stateless DataProtection tokens, not a DB table. Trade-off: cannot be revoked before TTL. Acceptable for v1; keep TTLs short. Future work: add an invitations table if revocation becomes a real need.
  • Saga ordering: Keycloak first, then registry. A registry write failure rolls back the Keycloak user; a Keycloak failure throws before any registry row is touched. Inverse of the pre-Epic-D order, where a Kc failure could orphan a tenant row.
  • registrationAllowed: true stays in the realm. The plan suggested flipping it once the custom signup proved out, but we never expose the Keycloak registration page — our /signup covers it. Leaving it true keeps Keycloak's own forgot-password / verify-email pages functional.
  • login.failed audit action is reserved, not wired. Hooking cookie / JWT failure events lands in a follow-up — the table column + action constant are in place so the writer is a one-liner later.
  • Members listing is client-side filtered, capped at 200 users. Sufficient for dev realms; larger realms need a server-side q=tenant:<id> rewrite.
  • /admin/* pages use page-level [Authorize(Roles=\"admin\")] and the API endpoints add a runtime ITenantContext.TenantId == route tenantId check so an Admin in tenant A cannot list tenant B.

Checklist

  • dotnet build passes locally in Release mode
  • dotnet test passes — 218 unit tests green, 5 live-smoke skipped (need API keys)
  • No secrets committed (dev placeholders are in gitignored appsettings.Development.json or Aspire Parameters with dev defaults in the AppHost's own appsettings.json)
  • README / docs updated — Task I (docs) was the optional stretch and isn't in this PR

Test plan

  • F5 the AppHost: Postgres + Keycloak + MailHog + API + Web come up; http://localhost:5180 lands on the dashboard
  • /signup with no slug → auto-create mode picks a slug like alice-3ab2f1, user lands on /signup/verify-email
  • /signup with slug acme → slug mode creates the workspace, user becomes admin
  • MailHog (http://localhost:8025) catches the verify email; clicking the link activates the user and login succeeds
  • Sign in as the admin → /admin/members lists the user, /admin/settings shows the workspace, /admin/audit shows signup.completed + tenant.created rows
  • /admin/members → mint invite URL → open in incognito → join existing tenant; audit shows invitation.minted

hoangsnowy and others added 7 commits May 30, 2026 11:20
Extract SignupValidation static helper consumed by both Blazor Signup
page and /tenants/register endpoint so form-side and server-side rules
cannot drift. Slug regex per plan: lowercase, internal hyphens only,
1-32 chars. Password: ≥12 chars + upper/lower/digit/symbol. Email via
MailAddress round-trip. 16 new test cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire MailHog into Aspire AppHost (SMTP 1025, UI 8025) and point the
realm smtpServer block at it. Set realm verifyEmail:true so unverified
users cannot log in. KeycloakAdminClient: when sendVerifyEmail flips
on, leave emailVerified=false and trigger Keycloak's execute-actions-
email; for self-signup (password supplied) the action list drops
UPDATE_PASSWORD and keeps only VERIFY_EMAIL. Signup page now passes
sendVerifyEmail:true and lands on /signup/verify-email with a hint
pointing dev users at MailHog.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ITenantSignupService dispatches invite / slug / auto-create by request
shape. Keycloak-first ordering so a registry-row failure can roll back
the Keycloak user (new DeleteUserAsync on the admin client).

Invitations are stateless DataProtection time-limited tokens — no DB
table — issued by /tenants/{id}/invitations (admin) and decoded by
/tenants/invitations/preview (public). Trade-off: tokens cannot be
revoked before TTL; keep TTL short for high-trust environments.

Signup page now accepts ?invite=<token>, pre-fills the email from the
invitation, and the Workspace ID field becomes optional (blank →
auto-generated slug). 7 new service tests + saga rollback verified
via NSubstitute.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the four red flags called out in the Epic D plan:

1. agentic-web client: directAccessGrantsEnabled flipped to false so
   Keycloak no longer accepts ROPC on the code-flow client.
2. RequireHttpsMetadata defaults to true; only flipped off in
   appsettings.Development.json. Same change in JwtAuthExtensions.
3. Cookie SecurePolicy = Always in non-Development, SameAsRequest in
   Development (Aspire pins http://localhost:5180 there).
4. agentic-web-dev-secret and admin/admin no longer hardcoded in any
   .cs file. AppHost surfaces them as Aspire Parameters with dev
   defaults in appsettings.json; Web Program throws at startup if
   ClientSecret is missing outside Development. Standalone
   `dotnet run --project src/AgentOs.Web` now expects user-secrets to
   supply the dev secret (Aspire AppHost injects it automatically).

grep agentic-web-dev-secret\|admin/admin --include=*.cs → no matches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds GET /tenants/{id}/members guarded by Admin policy AND a runtime
ITenantContext match (so an Admin in tenant A cannot list tenant B).
KeycloakAdminClient gains ListUsersByTenantAsync — paged GET + client-
side filter on the tenant attribute, plus a follow-up role-mappings
fetch per user. Bounded by max=200 for v1; larger realms need a
server-side q-search rewrite.

Two new Razor pages under /admin: Members.razor lists the tenant
roster and mints DataProtection invitation URLs (admin pastes them
into chat/email); Settings.razor shows tenant id + name + created-at
read-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New IAuditLog (EF + Null impls) writes append-only rows to
tenants.audit_events for signup-completed, tenant-created, member-
invited, and invitation-minted actions. Writes are best-effort —
audit failure logs a warning but never breaks the surrounding flow.
Reads are tenant-scoped at the repository, then re-checked at the
endpoint against ITenantContext (Admin in tenant A cannot peek at
tenant B's trail).

GET /tenants/{id}/audit + /admin/audit Razor page show newest-first
rows. login.failed action is reserved for a follow-up — wiring it
through cookie / JWT events lands separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
KeycloakAdminClient: 4 tests for DeleteUserAsync (no-content,
404 swallowed, 500 throws) and ListUsersByTenantAsync (filter +
role-mapping fetch).

HttpTenantContext: 5 tests for the claims projection — missing-claim
default, normal user, role flattening, IsAdmin off-by-role, and the
multi-value tenant claim edge case.

Epic D delta now sits at ~32 new tests (validation 16 + signup
service 7 + admin REST 4 + tenant context 5). Full unit suite stays
green (218 passed, 5 skipped live-smoke).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hoangsnowy hoangsnowy marked this pull request as ready for review May 30, 2026 04:42
@hoangsnowy hoangsnowy merged commit 4fd171f into main May 30, 2026
1 check passed
@hoangsnowy hoangsnowy deleted the feat/web-oidc-signup branch May 30, 2026 05:08
hoangsnowy added a commit that referenced this pull request May 30, 2026
The doc was a one-shot plan for the overnight Epic D autopilot run.
It's not project documentation, it doesn't reflect current direction
after Epic D landed (PR #17), and it shouldn't have made it into the
E3 commit — slipped past the staging filter that excluded it on E1
and E2. Removing per the product-focus rule: drop thesis / phase /
roadmap docs, keep only AI-essential docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hoangsnowy added a commit that referenced this pull request May 30, 2026
…idence (#21)

* feat(tools): add ITool/IToolRegistry contracts + InMemoryToolRegistry (E1)

Epic E1 lays the contract surface every later step composes on. Domain
gets six new types: ITool (callable capability), IToolRegistry
(register/resolve/list/unregister), ToolDefinition (name, description,
JSON input schema), ToolInvocationRequest/Result (carry CallId so the
orchestrator can match tool_use -> tool_result blocks), and
ToolException (distinct from LlmException so the orchestrator can tell
"model misused a tool" apart from "model call itself failed").

New AgentOs.Modules.Tools module ships the default
InMemoryToolRegistry (ConcurrentDictionary-backed so MCP probes and
the orchestrator can read/mutate concurrently) and a ToolsModule that
discovers ITool DI registrations at startup and pumps them into the
registry. Slnx + Tests csproj reference the new project.

22 new tests cover registry register/resolve/list/unregister, duplicate
detection, validation, and result factories. Full suite 248 pass,
5 skipped (live-LLM smoke).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(llm,tools): wire ITool into the LLM gateway + BuildVerifierTool (E2)

LlmRequest grows an optional `Tools` field — a flat list of tool names
the agent is allowed to invoke for this call. PooledChatLlmClient (the
prod path for Claude + AzureOpenAI key pools) now resolves each name
through IToolRegistry, wraps the chat client with
FunctionInvokingChatClient, and threads the resolved tools into
ChatOptions.Tools. The whole tool-call loop runs inside the gateway so
ILlmClient.SendAsync still returns one LlmResponse (the final text
turn) — agents don't need to learn a new contract.

AIToolFunction (Modules.Llm, internal) adapts a Domain.Tools.ITool
into a Microsoft.Extensions.AI.AIFunction: the schema string round-trips
through JsonElement so it's exposed verbatim to the model, and the
InvokeCoreAsync override serializes the model-emitted arguments back
into ITool's stringly-typed Input, runs ITool.InvokeAsync, and returns
the textual Output (or the error message) for the next LLM turn.

BuildVerifier gains a primitive VerifyFilesAsync(IEnumerable<...>,
CancellationToken) so callers without a full PipelineResult — like the
new build_verifier tool — can pass a flat file list. The legacy
VerifyAsync(PipelineResult, ...) delegates to it. BuildVerifierTool
(Modules.Integration.Tools) wraps that primitive with a tight JSON
contract — input `{files:[{path,content}]}`, output `{success,
exit_code, output, elapsed_ms}` — and IntegrationModule registers it
via AddTool<>() so ToolsModule.InitializeAsync auto-discovers it at
startup.

14 new tests: AIToolFunction surface + invocation + error path,
PooledChatLlmClient registers/resolves/drops tools across {no tools,
tool found, tool missing, no registry}, BuildVerifierTool JSON
validation + delegation + failure surfacing. Full suite 262 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(mcp): add MCP client module that registers remote tools (E3)

New AgentOs.Modules.Mcp consumes external MCP servers (GitHub MCP,
filesystem MCP, custom servers) and surfaces their tools into the
existing IToolRegistry under the prefixed name "{server}.{tool}". Once
registered, an LLM agent can invoke a remote MCP tool through the same
LlmRequest.Tools path E2 wired for local ITools — no agent-side change.

McpOptions binds the per-server config (stdio command+args+env or
HTTP/SSE URL, enabled flag, call timeout). McpClientHost holds the
live McpClient connections, ListToolsAsync's each one at startup, and
wraps every returned McpClientTool into McpToolAdapter (ITool) with
the remote schema and description carried over verbatim. A failed
server connection is logged and skipped — the rest of the host still
boots. DisposeAsync unregisters every name and closes every client.

McpToolAdapter takes a McpToolInvoker delegate instead of an
McpClientTool directly so tests can stub MCP without a live server.
McpClientHost owns the delegate wiring and applies the per-tool-call
timeout (default 60s, configurable via Mcp:CallTimeoutSeconds).

10 new tests: McpToolAdapter input parsing + delegation + error
surfacing + cancellation, McpOptions config binding (defaults, two
servers, args/env round-trip, enabled flag). McpClientHost itself is
not yet exercised by tests — a live-MCP smoke test will land alongside
the sample MCP-GitHub integration in E6. Full suite 272 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: drop Epic-D plan doc from the repo

The doc was a one-shot plan for the overnight Epic D autopilot run.
It's not project documentation, it doesn't reflect current direction
after Epic D landed (PR #17), and it shouldn't have made it into the
E3 commit — slipped past the staging filter that excluded it on E1
and E2. Removing per the product-focus rule: drop thesis / phase /
roadmap docs, keep only AI-essential docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(api,mcp): expose AgentOs pipeline as MCP server (E4)

The API host now serves MCP at /mcp via the official
ModelContextProtocol.AspNetCore Streamable HTTP transport, with three
tools surfaced to remote MCP clients (Claude Desktop, VS Code, custom
orchestrators):

- run_pipeline(description, max_iterations?, locale?) -> PipelineResult
  full 5-agent run on a single user story, collected to the final result
- list_runs(limit?) -> PipelineRunSummary[] (paged, capped 1..100)
- get_run(runId) -> PipelineRunRecord (artifacts + per-call metrics)

PipelineMcpTools is a [McpServerToolType] static class; the IPipelineClient
/ IPipelineRunRepository dependencies are injected per call by the MCP
SDK's tool factory, so existing ITenantContext / auth still applies
(JWT bearer middleware runs before /mcp routes).

The Api host now also loads ToolsModule + McpModule alongside the
others — Integration's BuildVerifierTool needs IToolRegistry to be
registered, and McpModule's startup hook wires upstream MCP servers
into the same registry. Together with E1-E3 this closes the
tools-mesh loop: AgentOs is both an MCP client (consumes external
tools) and an MCP server (exposes its own pipeline tools).

No new tests this step — a meaningful MCP smoke test wants a live
TestServer + an in-process McpClient round-trip and belongs with the
sample app in E6. Build clean, full suite 272 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(tools): add IToolPolicy gate + IToolInvocationLog evidence sink (E5)

Every tool call the LLM gateway routes through AIToolFunction now
passes through two new seams:

- IToolPolicy.EvaluateAsync(request) — pre-invocation gate. Denied
  calls short-circuit; the policy's reason string is fed back to the
  LLM as the tool_result so the model can react (retry with different
  arguments, give up, ask the user). Default impl PermissiveToolPolicy
  allows everything — production wires a tenant-aware impl that reads
  the allowlist + cost cap from AppConfig.

- IToolInvocationLog.AppendAsync(evidence) — best-effort evidence sink.
  One entry per invocation (allowed-and-succeeded, allowed-and-errored,
  or denied) capturing the call id, tenant, run id, tool name, input
  JSON, output, error flag, and start/finish timestamps. The audit
  trail covers refusals as well as successful runs. Log failures are
  swallowed — a downstream sink outage must never break a tool call.
  Default impl InMemoryToolInvocationLog is a per-tenant ring buffer
  bounded at 500 entries so a runaway loop can't OOM the host.

PooledChatLlmClient resolves both interfaces optionally from DI when
constructing AIToolFunctions, so existing tests + hosts that don't
register the new services keep working with null behaviour.

EF-backed persistence (tool_invocations table in the tools schema) +
the policy that loads tenant allowlists from AppConfig are deferred
to E5.next — this commit ships the abstractions and the in-memory
defaults so the rest of the platform can already start handing evidence
data structures around.

9 new tests: InMemoryToolInvocationLog (recency order, per-tenant
isolation, cap enforcement, limit), AIToolFunction integration (policy
deny short-circuit, allow path, no-policy no-log fallback, log-failure
absorption). Full suite 281 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(readme): document the Tools + MCP modules and the /mcp endpoint (E6)

Adds two new module entries (AgentOs.Modules.Tools,
AgentOs.Modules.Mcp), updates the Integration line so the
BuildVerifierTool registration is visible, extends the cross-module
dep note with Integration -> Tools, and adds a Tools & MCP subsection
explaining the LlmRequest.Tools loop, the policy + evidence seams, and
the fact that AgentOs is now both an MCP client and an MCP server.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant