Cloud gateway v next#87
Open
shleder wants to merge 12 commits into
Open
Conversation
Snapshot of the newer cloud API gateway / MCP Trust-Gates firewall for external architecture review. Published to a dedicated branch so GitHub main (older) is not overwritten. Scope: - Cloud gateway src/ (auth, tenant isolation, SSRF filter + IP pinning, schema/AST/honeytoken/scope/preflight gates, per-tenant token bucket, optional AI jailbreak guard, dynamic policy, BYOT tool registry). - PostgreSQL + pgvector data layer (reader/writer split, migrations, semantic + L2 cache, billing idempotency). - Stripe billing (checkout, portal, webhook with HMAC + replay window + idempotency), Resend email, SIEM streamer, Prometheus metrics. - Fly.io + Docker deployment, monitoring stack, compatibility layer (OpenAI/Anthropic), workspaces (langchain, vercel-ai, dashboard, portal). - AI knowledge base under docs/ai-context/ and Kiro steering under .kiro/steering/. Excluded (secrets/artifacts): all .env* except .env.example, logs, test-results, node_modules, local DB/cache, loose native binaries. NOT production-ready: npm run assert:package-metadata fails (package.json files[] omits dist/utils/child-env.*) and DB-dependent test suites were not run locally (no DATABASE_URL). See docs/ai-context/CHANGELOG_FOR_AI.md.
…, prod blockers, docs)
- package metadata: remove dist/utils/child-env.{js,d.ts} from package.json
files[] and lock them in scripts/assert-package-metadata.mjs forbiddenFiles.
child-env is only used by the unpublished gateway-config / stdio paths and
is not part of the published lib.js surface. `npm run assert:package-metadata`
and `npm run verify:all` are now GREEN.
- prod boot guard: add validateProductionDatabaseUrl() in src/index.ts. When
NODE_ENV=production and neither DATABASE_URL nor MASTER_DATABASE_URL is set,
the process refuses to start. Defense-in-depth: /health returns 503 (not
healthy) in production when the DB is unconfigured (no serving from
in-memory stores).
- prod blocker (documented, not fixed): Postgres TLS rejectUnauthorized:false
in src/database/postgres-pool.ts annotated with TODO(vNext, prod-blocker);
not claimed secure.
- prod blocker (documented, not fixed): trust proxy 'loopback' in src/index.ts
annotated with TODO(vNext, prod-blocker) for Fly/edge deployments.
- docs: PROJECT_SNAPSHOT.md + CHANGELOG_FOR_AI.md now reflect branch
cloud-gateway-vNext, clean tree, successful push; the stale ABORTED release
event is superseded by a successful publish event; critical-file hashes
recomputed. SECURITY_AUDIT.md F-01/F-02 marked with vNext status.
Verification: assert-package-metadata PASS, typecheck PASS, build PASS,
test 467 passed / 3 skipped, verify:all GREEN. NOT production-ready: see
production_blockers in CHANGELOG_FOR_AI.md. main untouched.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
…ked CI
Fixes the two highest-risk production blockers on cloud-gateway-vNext.
F-01 Postgres TLS certificate verification:
- Add resolvePostgresTls() (testable, fail-closed). Non-local DBs now use
rejectUnauthorized:true with optional CA from PG_CA_CERT (inline PEM) or
PGSSLROOTCERT (file path), else the system CA store. Production rejects
sslmode=disable and PG_TLS_INSECURE at config time. Local dev/test
(localhost) stays no-TLS. Never logs URL/password/CA contents.
- Tests: tests/postgres-tls.test.ts.
F-02 reverse-proxy / client-IP trust:
- Add src/config/proxy-trust.ts: resolveTrustProxySetting() drives
app.set('trust proxy', ...) and FAILS LOUD in production when
MCP_TRUST_PROXY is unset/"true"/garbage. fly.toml sets MCP_TRUST_PROXY=1.
- Color-boundary state now keyed by buildColorBoundaryKey (tenant-namespaced,
never raw IP alone) in both the middleware and the dispatcher, so two
tenants behind one proxy IP cannot share boundary state.
- HTTP_REQUEST audit now records clientIp + proxyIp.
- Tests: tests/proxy-trust.test.ts (incl. Express XFF integration).
DB-backed CI:
- Add .github/workflows/ci-db.yml: runs the full suite against
pgvector/pgvector:pg16 with DATABASE_URL set, creates the vector
extension, and fails if DB-dependent suites self-skip.
Docs: SECURITY_AUDIT (F-01/F-02 FIXED), RUNTIME_AND_DEPLOYMENT (TLS +
proxy + boot guards), TESTING_GAPS (validation tiers), PROJECT_SNAPSHOT
(blockers), CHANGELOG_FOR_AI (hashes + release event #3). .env.example
documents the new knobs.
No feature work. SSRF/tenant-isolation/auth/rate-limit/schema/cache-poison
protections unchanged. Local verify:all GREEN (24 suites, 499 passed,
3 skipped). DB-backed path runs in CI. main untouched.
Replace the placeholder/stale SHA in CHANGELOG_FOR_AI.md with the actual HEAD of the TLS + reverse-proxy hardening commit. No code change.
Ledger (docs/ai-context/CHANGELOG_FOR_AI.md): - git_commit_head and release_event_3.new_head now equal the actual branch HEAD (e281ff1), replacing the stale f9426c5. - record explicit commit list: f9426c5 (TLS/proxy hardening) + e281ff1 (ledger SHA precision). - record the DB-backed CI observation on e281ff1 and the guard fix. CI (.github/workflows/ci-db.yml): - The DB suites ran AND passed against pgvector on run 26638123393, but the final visibility-guard step false-failed: it grepped `jest --verbose` run output for "tests/<suite>.test.ts", a string jest does not emit verbatim. - Replace it with a deterministic `jest --listTests` enumeration (respects testPathIgnorePatterns, prints absolute paths) run BEFORE the suite, proving the DB suites are in the run set when DATABASE_URL is set. The run step is now a plain `npm test`. No application source changed. main untouched.
snapshot.git_commit_head was stale (8b00727 while branch tip was dce23a3). Add release_event_5 recording the actual DB-backed CI conclusion on the current HEAD (run 26638935114 = FAILURE: guard fix worked, ~15 DB suites genuinely fail) and correct event_4's unverified 'green' prediction. Explicitly record f9426c5 (TLS/proxy hardening) and e281ff1 (ledger SHA precision) in the commit ledger. Docs-only; no application logic changed; main untouched.
…failure) Upgrade event_5 ci_trigger_decision from prediction to verified fact: the docs-only push (a293778+a422e71) triggered run 26651124794 on a422e71 which concluded FAILURE (no-DB gate green, DB-integration red) - same root cause, as expected for a docs-only change. Below the self-hash marker, so pre-marker self-hash is unchanged. Docs-only; main untouched.
Trailing 1-line head-sync commit. Sets snapshot.git_commit_head=24046c2 (the verified-CI-rerun ledger commit), appends a293778+24046c2 to git_commit_base lineage, and updates self_hash_prefix to C0A47AEFC142D57F. Docs-only; main untouched.
Migrate ~15 DB-backed test suites from the removed SQLite path (Phase 39 SQLite->Postgres) so they run correctly under DB-backed CI. The suites self-skipped locally (no DATABASE_URL) which hid the breakage; CI runs them against pgvector and they failed at load/assert time. Changes (tests only; no application source touched): - Replace deleted imports ../src/database/sqlite-pool.js and ../src/cache/semantic-store-sqlite.js with the Postgres API (postgres-pool / semantic-store-postgres). - Await now-async calls (key-registry, tiers, rate-limiter, pending-checkouts, metrics aggregator, semantic store) consumed synchronously. - Use describeWithDb + setupDbHarness for DB-touching suites (self-skip without DATABASE_URL, run migrations + truncate in CI) instead of initializePersistentStores/MCP_DB_MEMORY. - semantic-caching: migrate to pgvector API (findSemanticHit miss => undefined; isSemanticCacheEnabled() no-arg; drop removed save-cap options). - production-seeding: markerDir API; drop SQLite-only resolveDbFile/.sqlite assertions (no Postgres equivalent). - production-email: replace SQLite pool-flush shutdown test with a Postgres-path graceful-shutdown assertion. - tier-rate-limiting/token-bucket: await async store calls; seed permissive policy so dispatcher integration tests assert tier/bucket behaviour without fail-closing at the policy gate. No security invariant weakened: tenant isolation, auth 401s, billing replay/idempotency, signature refusal, revocation, and semantic-cache tenant isolation assertions are all preserved. DB-suite visibility guard in ci-db.yml unchanged and still valid.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
describe the user-visible or maintainer-visible change
list the main files or subsystems touched
explain the problem being solved
explain how the change fits the fail-closed stdio-first product shape
npm run verify:allnpm run demo:stdioif runtime or trust-gate behavior changednpm run benchmark:stdio -- --json --output evidence.jsonif security claims, benchmark corpus, or cache behavior changednpm run pack:dry-run && npm run pack:smokeif packaging, CLI surface, docs install commands, or release workflows changeddocs updated if claims, demos, release notes, or repo metadata changed
note any residual risks, unsupported claims, metrics changes, or follow-up work