Skip to content

feat(llm,github): tenant-scoped runtime overrides + GitHub polish#19

Merged
hoangsnowy merged 10 commits into
mainfrom
feat/tenant-llm-and-github-polish
May 30, 2026
Merged

feat(llm,github): tenant-scoped runtime overrides + GitHub polish#19
hoangsnowy merged 10 commits into
mainfrom
feat/tenant-llm-and-github-polish

Conversation

@hoangsnowy
Copy link
Copy Markdown
Owner

Change description

Stacked on top of #18. Re-base targets main automatically once #18 merges.

Closes a multi-tenant correctness bug in the LLM gateway and the GitHub integration, then polishes the GitHub PR opener.

The bug

RuntimeOverrides used to be an in-memory singleton populated once at startup by HydrateRuntimeOverridesAsync — running inside whatever DI scope the AppHost happened to spin up first. Whatever tenant that scope resolved to became the source of truth for the entire process. Every subsequent request — regardless of which tenant signed in — read the SAME _overrides.AnthropicApiKey, _overrides.GitHubPat, etc.

In multi-tenant deployments this means tenant A's Anthropic key bills tenant B's pipeline runs, and tenant B's PR opener pushes to tenant A's GitHub repo. Not a hypothetical leak — it's the only behavior the old singleton could produce.

The fix

RuntimeOverrides becomes a thin read / write-through adapter over IAppConfigStore. Every property access creates a per-call DI scope, resolves the current ITenantContext, and reads / writes the matching app_config row. IAppConfigStore's existing 15s per-(tenant, key) cache absorbs the read pressure, so a key-pool factory closure that reads _overrides.AnthropicApiKey on every LLM call only hits the DB once per cache window per tenant.

The IRuntimeOverrides interface keeps its sync getter/setter shape — LlmModule's key-pool factories (registered as Singleton) and GitHubPrService (registered as Transient) don't need to become async. Sync-over-async is bridged inside the impl; ASP.NET Core has no SynchronizationContext, so the bridge is deadlock-safe.

GitHub integration polish

  • New GitHubBaseUrl override per tenant → Octokit hits a GHE host instead of public github.com when set. Blank / null → public host.
  • Pre-flight missing X, Y, Z message lists every absent config field instead of "PAT and target repo must be set".
  • NotFoundException on the base-branch fetch → wrapped as "branch X not found, set GitHubBaseBranch or create it".
  • AuthorizationException → wrapped as "PAT rejected, check 'repo' scope and expiry".
  • Branch prefix agentic-sdlc/<ts>agentos/<ts> so operators can git push origin --delete agentos/* to clean up.

Removals

  • HydrateRuntimeOverridesAsync (LlmModule no longer implements IInitializableModule)
  • SaveRuntimeOverridesAsync (Settings UI writes through the setters)
  • File RuntimeOverridesExtensions.cs deleted entirely

Type of change

  • Bugfix (multi-tenant isolation)
  • Refactor (no observable behavior change in single-tenant mode)

Test plan

  • dotnet build Release: 0 warnings, 0 errors
  • dotnet test: 226 passed, 5 skipped (live-smoke)
  • 8 new tests:
    • 5 tenant-isolation tests on RuntimeOverrides (Tenant A writes, Tenant B reads null; per-tenant GitHub PAT; key-pool round-trips; blank value deletes; no-store falls back to silent no-op)
    • 3 GitHub config branches (missing PAT → message names it, missing owner+name → both listed, blank title → ArgumentException)

Decisions

  • Sync interface preserved. Changing IRuntimeOverrides to async would force a cascade through PooledChatLlmClient.SendAsync (the key-pool delegate is sync) and GitHubPrService. Trade-off: a .GetAwaiter().GetResult() lives in the impl. ASP.NET Core has no SynchronizationContext, so this is deadlock-safe; the per-(tenant, key) 15s cache caps DB pressure.
  • No cache invalidation event. When the Settings UI writes a new key, the next 15s window of reads still serves the old value. Acceptable for an operator-config surface; documented in code.
  • GHE base URL stored per tenant. Different tenants may live on different GHE instances. The current single-tenant install just leaves it blank.

Checklist

  • dotnet build passes locally in Release mode
  • dotnet test passes
  • No secrets committed
  • README / docs updated — none required (private-by-default API surface, tests document the new behavior)

hoangsnowy and others added 9 commits May 30, 2026 11:20
Extract SignupValidation static helper consumed by both Blazor Signup
page and /tenants/register endpoint so form-side and server-side rules
cannot drift. Slug regex per plan: lowercase, internal hyphens only,
1-32 chars. Password: ≥12 chars + upper/lower/digit/symbol. Email via
MailAddress round-trip. 16 new test cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire MailHog into Aspire AppHost (SMTP 1025, UI 8025) and point the
realm smtpServer block at it. Set realm verifyEmail:true so unverified
users cannot log in. KeycloakAdminClient: when sendVerifyEmail flips
on, leave emailVerified=false and trigger Keycloak's execute-actions-
email; for self-signup (password supplied) the action list drops
UPDATE_PASSWORD and keeps only VERIFY_EMAIL. Signup page now passes
sendVerifyEmail:true and lands on /signup/verify-email with a hint
pointing dev users at MailHog.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ITenantSignupService dispatches invite / slug / auto-create by request
shape. Keycloak-first ordering so a registry-row failure can roll back
the Keycloak user (new DeleteUserAsync on the admin client).

Invitations are stateless DataProtection time-limited tokens — no DB
table — issued by /tenants/{id}/invitations (admin) and decoded by
/tenants/invitations/preview (public). Trade-off: tokens cannot be
revoked before TTL; keep TTL short for high-trust environments.

Signup page now accepts ?invite=<token>, pre-fills the email from the
invitation, and the Workspace ID field becomes optional (blank →
auto-generated slug). 7 new service tests + saga rollback verified
via NSubstitute.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the four red flags called out in the Epic D plan:

1. agentic-web client: directAccessGrantsEnabled flipped to false so
   Keycloak no longer accepts ROPC on the code-flow client.
2. RequireHttpsMetadata defaults to true; only flipped off in
   appsettings.Development.json. Same change in JwtAuthExtensions.
3. Cookie SecurePolicy = Always in non-Development, SameAsRequest in
   Development (Aspire pins http://localhost:5180 there).
4. agentic-web-dev-secret and admin/admin no longer hardcoded in any
   .cs file. AppHost surfaces them as Aspire Parameters with dev
   defaults in appsettings.json; Web Program throws at startup if
   ClientSecret is missing outside Development. Standalone
   `dotnet run --project src/AgentOs.Web` now expects user-secrets to
   supply the dev secret (Aspire AppHost injects it automatically).

grep agentic-web-dev-secret\|admin/admin --include=*.cs → no matches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds GET /tenants/{id}/members guarded by Admin policy AND a runtime
ITenantContext match (so an Admin in tenant A cannot list tenant B).
KeycloakAdminClient gains ListUsersByTenantAsync — paged GET + client-
side filter on the tenant attribute, plus a follow-up role-mappings
fetch per user. Bounded by max=200 for v1; larger realms need a
server-side q-search rewrite.

Two new Razor pages under /admin: Members.razor lists the tenant
roster and mints DataProtection invitation URLs (admin pastes them
into chat/email); Settings.razor shows tenant id + name + created-at
read-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New IAuditLog (EF + Null impls) writes append-only rows to
tenants.audit_events for signup-completed, tenant-created, member-
invited, and invitation-minted actions. Writes are best-effort —
audit failure logs a warning but never breaks the surrounding flow.
Reads are tenant-scoped at the repository, then re-checked at the
endpoint against ITenantContext (Admin in tenant A cannot peek at
tenant B's trail).

GET /tenants/{id}/audit + /admin/audit Razor page show newest-first
rows. login.failed action is reserved for a follow-up — wiring it
through cookie / JWT events lands separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
KeycloakAdminClient: 4 tests for DeleteUserAsync (no-content,
404 swallowed, 500 throws) and ListUsersByTenantAsync (filter +
role-mapping fetch).

HttpTenantContext: 5 tests for the claims projection — missing-claim
default, normal user, role flattening, IsAdmin off-by-role, and the
multi-value tenant claim edge case.

Epic D delta now sits at ~32 new tests (validation 16 + signup
service 7 + admin REST 4 + tenant context 5). Full unit suite stays
green (218 passed, 5 skipped live-smoke).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Microsoft.Extensions.* 10.0.0 → 10.0.8
- Microsoft.Extensions.Http.Resilience 10.0.0 → 10.6.0
- Microsoft.Extensions.Logging.Abstractions 10.0.6 → 10.0.8
- Microsoft.AspNetCore.* (OpenIdConnect / JwtBearer / OpenApi / SignalR
  Client / DataProtection.Abstractions) 10.0.0 → 10.0.8
- Microsoft.EntityFrameworkCore (+ .Design / .InMemory) 10.0.0 → 10.0.8
- Npgsql.EntityFrameworkCore.PostgreSQL 10.0.0 → 10.0.2
- Microsoft.Agents.AI* 1.7.0 → 1.8.0
- Aspire.Hosting* 13.3.3 → 13.3.5 (Keycloak preview stays on 13.3.3)
- Microsoft.Playwright 1.49.0 → 1.60.0

Three csproj files gain explicit Microsoft.EntityFrameworkCore +
.Relational PackageReference at 10.0.8 — without the pin, MAF's
transitive 10.0.4 wins the conflict resolution, ModuleLoader
ReflectionTypeLoadException blows up loading Relational 10.0.8 from
Npgsql 10.0.2.

Deferred (major bumps, separate PR):
- xunit.v3 1.1 → 3.2
- Microsoft.NET.Test.Sdk 17.13 → 18.6
- coverlet.collector 6.0.4 → 10.0.1
- JsonSchema.Net 7.x → 9.2.1

Build: 0 warnings, 0 errors. Tests: 218 passed, 5 skipped (live-smoke).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the in-memory singleton RuntimeOverrides with a thin read /
write-through adapter over IAppConfigStore. Every property access now
resolves the current ITenantContext (via a per-call DI scope) and
reads/writes the tenant's row in app_config — Tenant A's Anthropic
key, Azure endpoint, and GitHub PAT can no longer leak to Tenant B
on the same process.

The pre-existing in-memory model leaked: HydrateRuntimeOverridesAsync
ran once at startup with whatever tenant the bootstrap scope happened
to resolve, and from then on every tenant saw the SAME values. That
codepath is gone — Hydrate / Save extensions deleted, LlmModule no
longer implements IInitializableModule.

The IRuntimeOverrides interface keeps its sync getter/setter shape so
LlmModule's key-pool factories (singleton) and GitHubPrService
(transient) don't need to become async. Sync-over-async is bridged in
the impl; EfAppConfigStore's 15s per-(tenant,key) cache means the
bridge is hit at most once per cache window per property.

GitHub polish:
- New GitHubBaseUrl override → Octokit picks up the GHE endpoint when
  set, falls back to public github.com otherwise.
- Clear "missing X for this tenant" message lists every absent field
  instead of throwing a generic exception.
- NotFoundException / AuthorizationException from the base-branch GET
  are rewrapped with actionable hints (branch missing vs PAT bad).
- Branch prefix renamed agentic-sdlc → agentos.

8 new tests: 5 tenant-isolation + 3 GitHub config branches.
Full unit suite: 226 passed, 5 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hoangsnowy hoangsnowy changed the base branch from chore/package-bumps to main May 30, 2026 04:57
@hoangsnowy hoangsnowy merged commit 7ae2f4b into main May 30, 2026
1 check passed
@hoangsnowy hoangsnowy deleted the feat/tenant-llm-and-github-polish branch May 30, 2026 05:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant