App/cross year matching 1#62
Conversation
per-partner toggle for cross-year student ID matching. defaults to false so existing partners are unchanged. temporary feature until an ID resolution service is in place.
specifies eduCredsExist + getEduConnectionInfo. covers local env-var fallback path for development. AWS Secrets Manager path is exercised in deployed environments and not unit-tested here. failing — methods not yet implemented.
getEduConnectionInfo reads from EDU_SNOWFLAKE_* env vars when present (local dev), otherwise looks up the AWS secret edu-connection-info-<partnerId>. returns null when nothing is available so callers can degrade gracefully — cross-year matching is best-effort, not required for a run to proceed. eduCredsExist is a boolean wrapper for use in payload assembly without leaking the creds themselves.
specifies the contract between app and executor: - crossYearMatchAvailable is always present - when toggle on AND creds exist: flag=true, appUrls.roster set - when either is missing: flag=false, no roster URL failing — payload assembly doesn't read the toggle yet.
EarthbeamApiJobResponseDto gains crossYearMatchAvailable (always present, boolean) and an optional appUrls.roster. the flag is true iff the partner has cross-year matching enabled AND EDU creds are available; otherwise jobs proceed without cross-year matching. the roster URL is omitted entirely when the flag is false so the executor can't accidentally call a disabled endpoint.
specifies the streaming roster endpoint: - 401 for unauthenticated requests (guard) - 409 when toggle off or creds missing - 200 + application/x-ndjson when both prerequisites are met - abrupt close on mid-stream errors (no in-band sentinel) the service-level streamCrossYearRoster stub is in place so the controller can be wired and the test suite compiles; implementation lands in the green step.
GET /earthbeam/jobs/:runId/roster streams the partner's cross-year roster to the executor as NDJSON. Reuses the existing executor JWT guard; returns 409 if the toggle is off or EDU creds are missing. The Snowflake client is lazy-imported on first use — snowflake-sdk has slow module-init side effects we shouldn't pay on every app boot, and cross-year fetches are a rare hot path. Connection is per-request and closed in finally. Mid-stream errors abort the response (no in-band sentinel); the executor detects truncation and fails the run. The query is a TODO placeholder until the data engineer supplies the real one — controller wiring, auth, and error semantics are all in place so the swap is one-line.
Move the Snowflake staging schema into the EDU connection payload, use the real cross-year roster query, and make partner-specific AWS secrets take precedence over local env fallbacks. This keeps Snowflake config in one place and restores the intended secret-first behavior for deployed environments. Co-authored-by: Codex <codex@openai.com>
Use NODE_ENV to gate the EDU Snowflake env-var fallback so deployed environments rely on partner-specific AWS secrets. If the secret is missing or incomplete outside local development, return null instead of silently falling back. Co-authored-by: Codex <codex@openai.com>
Add the local-development EDU Snowflake fallback variables to the API env template so cross-year roster testing can be configured without guessing the required keys. Co-authored-by: Codex <codex@openai.com>
Clarify in the API env template that the local EDU Snowflake key values should contain base64-encoded PEM contents, with quick examples for file and clipboard encoding. Co-authored-by: Codex <codex@openai.com>
The previous `hostname.split('.')[0]` only kept the leading subdomain,
dropping the region segment for URLs like `<account>.<region>.snowflakecomputing.com`
and causing the SDK's JWT auth to fail. Strip the `.snowflakecomputing.com`
suffix instead so the full account locator is preserved.
Co-authored-by: Cursor <cursoragent@cursor.com>
Previously we accepted a Snowflake URL and parsed an `account` out of it, and used the schema by interpolating it into the SQL query. Switching to discrete `account`, `database`, and `schema` values that get passed to `snowflake.createConnection` removes the URL parsing, drops the SQL-identifier validator (no longer interpolated), and lets the SDK namespace the query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Opening a Snowflake connection takes ~20s — paying that on every roster request was the dominant cost. Introduce EduSnowflakePoolService that keeps one snowflake-sdk pool per partner (min 1, max 4). On module init we fire-and-forget warm a pool for every partner with cross-year matching enabled, so first-request latency disappears for warm pools. Requests for partners without a warmed pool create one on demand and share the same in-flight promise across concurrent callers. Failed pool creation is evicted so the next request retries. streamCrossYearRoster now goes through eduPool.use(partnerId, cb) instead of opening and tearing down its own connection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The env-var fallback in AppConfigService.getEduConnectionInfo is gated on isDevEnvironment() (NODE_ENV === 'development'). Jest defaults NODE_ENV to 'test', so the tests stopped exercising the env-var path once the gate was added — eduCredsExist returned false and the payload omitted the roster URL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
the previous write loop discarded response.write's return value, which meant a slow executor would cause node to buffer the entire roster in memory. switch to stream.pipeline so node manages backpressure end to end; pipeline also destroys downstream streams on error, preserving the abrupt-close behavior on mid-stream snowflake failures.
getEduConnectionInfo previously collapsed every secrets-manager error (IAM, throttling, network, malformed JSON, missing) to null, which collapsed to a 409 from the roster endpoint and crossYearMatchAvailable false in the job payload. operationally indistinguishable from "feature off." now it returns null only for ResourceNotFoundException and logs + rethrows everything else. eduCredsExist (used at payload assembly time) still swallows so unrelated run creation degrades gracefully. the roster endpoint's prereq check switches to calling getEduConnectionInfo directly so a real aws outage produces a 5xx, not 409.
the two streaming tests used to mock streamCrossYearRoster itself, so they never exercised the pipeline, transform, sql, or binds. they now spy on EduSnowflakePoolService.use and supply a fake connection whose execute() returns a Readable, letting the real service body run. happy path asserts on the captured execute args (sqlText targets the EDU staging tables, binds is [tenantCode], streamResult is true) plus the NDJSON body. mid-stream-error asserts the response truncated after exactly one row with no in-band sentinel, replacing a test that asserted nothing.
the catch handler unconditionally deleted the partner key, which could clobber a newly-inserted (in-flight) pool entry under the same key. only delete if the map still points at the failed entry.
snowflake key-pair auth uses only the private key; the public half is uploaded to snowflake out-of-band. carrying publicKey through the config layer added nothing and the "missing publicKey → null" gate would have silently disabled the feature for an otherwise-valid secret. removed from the EduConnectionInfo type, both secret and env-var fallbacks, the validation check, .env.copyme, and the test fixtures.
the executor's conditional roster-fetch step was missing from the Executor Lifecycle section. also add the DTO file to the App↔Executor key files. project policy in AGENTS.md itself requires doc updates alongside behavior changes.
generic-pool defaults acquireTimeoutMillis to forever, so under saturation (5+ concurrent runs for the same partner) requests could hang indefinitely. 60s is generous over the ~20s JWT connect observed locally and gives a clear timeout error instead of an unbounded wait.
without onModuleDestroy, open snowflake sockets outlived the process on SIGTERM (beanstalk/ECS deploys), risking hung shutdowns past the platform grace period. drains each pool with a 10s race timeout under Promise.allSettled so one stuck pool can't block the others.
the abrupt-close decision was for mid-stream failures (after rows have been written). when pool acquisition or connection.execute fails before the first row, no bytes are flushed yet — tearing the TCP connection just makes it harder for the executor to diagnose. the controller now checks res.headersSent and throws InternalServerErrorException when nothing has been sent, falling back to res.destroy only after headers are out. test covers the pre-stream case.
generic-pool's pool.use() releases the resource on both resolve and reject, so a connection whose row stream errored mid-flight gets handed to the next acquirer as healthy. switch to explicit acquire + release on success / destroy on error so the suspect connection leaves the pool.
This reverts commit f1c566a.
the private async getAWSSecret signature lost its 2-space class-member indent — re-indent to match neighbors.
the existing entry only mentioned cross-year as a side note on the ODS roster fetch step, which implied a single transform pass that includes the cross-year roster up front. the actual flow is a second earthmover pass triggered only when the first pass leaves unmatched students and the partner has the feature enabled.
eduCredsExist was a thin wrapper that called getEduConnectionInfo, did !== null, and swallowed errors as "false". the swallow-on-error decision is specific to job-payload assembly (don't break run creation if Secrets Manager hiccups) — moving it to that one caller keeps app-config focused on fetching creds rather than also owning what to do when fetching fails. drops the wrapper and its two thin tests; the real fetch is still covered by the getEduConnectionInfo tests and by the cross-year payload tests in earthbeam-api.spec.ts.
inline the env-var fallback path (matching how postgresPoolConfig and encryptionKey already handle their AWS/env split) and drop the EduConnectionInfo named type — callers infer the shape. brings the EDU method into line with the rest of AppConfigService.
2b9c110 to
b02fa5d
Compare
motivation: - the central secrets cache in AppConfigService meant cred rotation required a process restart on top of the pool refresh PR 2 already plans. - "do creds exist?" was answered by AppConfigService, which had no natural way to bypass its own cache for the EDU path. - payload assembly and the roster endpoint were both asking AppConfig about EDU, even though the pool service is the thing that actually uses the creds. changes: - AppConfigService.getAWSSecret now wraps a new uncached fetchAWSSecret; postgres/encryption/jwt callers keep cached semantics, EDU calls fetchAWSSecret directly. one cache for one concern. - EduSnowflakePoolService gains a public canConnect(partnerId) that answers from a resolvedPools set when a pool is already up (no AWS call needed) or falls back to a fresh getEduConnectionInfo lookup. failures are swallowed + logged so payload assembly can degrade. - earthbeamInputForRun and getCrossYearRosterContext both ask the pool service instead of AppConfig. the 5xx-vs-409 distinction at the roster endpoint becomes 409 for any "can't connect" reason; real failures during streaming still propagate through the controller's headersSent-aware catch (500 pre-stream, abrupt close mid-stream).
getCrossYearRosterContext was calling eduPool.canConnect to decide between 409 and proceeding to stream. with the secrets cache gone for EDU, that meant a cold roster request did two uncached AWS lookups — one for the prereq check, one for createPool inside the stream. drop the check. the toggle check still answers "feature off for this partner" (409). everything else flows through the stream attempt; pool creation failures hit the controller's headersSent-aware catch and turn into a clean 500. the existing "missing EDU creds" test moves from 409 to 500 (and still exercises the real pool-creation failure path, unlike the mocked-pool 500 test).
| if (cachedValue) { | ||
| return cachedValue; | ||
| } | ||
| const secretValue = await this.fetchAWSSecret(secretName); |
There was a problem hiding this comment.
I don't like how AppConfigService caches AWS secrets. For the EDU creds secret, I want to make it easy to grab fresh values in the secret without restarting the app and caching would prevent that.
For now, I just decided to factor fetching the secret out of caching since there are other callers of getAWSSecret and we haven't had any actual issues with caching. Long term, I think it makes sense to remove caching entirely from AppConfigService. Callers should instead manage their own caches if they want to have a cache at all.
So multiple environments (dev/stg/prod) sharing one AWS account can keep distinct EDU connection secrets per partner. Throws if ENVLABEL is unset, matching the existing ECS-config helper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Method had one caller and most of its LOC were SUCCESS/ERROR object construction. The controller now does the run lookup + toggle check directly. partnerId/tenantCode are read from run.job scalars for symmetry between the two args. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Single caller, just a query + response. Controller now injects EduSnowflakePoolService directly and owns the full Snowflake-to-NDJSON pipeline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Snowflake server-side session timeouts would silently kill the min:1 warm connection after hours of idle, causing the next request to fail on a stale handle. Switched to min:0 with eviction every 60s on a 60s idle threshold, per Snowflake's docs. With min:0 the startup warm-up creates pool objects but no actual sockets, so it became dead weight — removed warmPools, OnModuleInit, and the now-unused PRISMA_READ_ONLY dep. First request per partner unconditionally pays the ~20s JWT connect cost; fine for a temporary feature on low-traffic partners. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Drop resolvedPools — pools.has gives the same fast-path for canConnect with a benign in-flight-failure race. - Early-return in getOrCreatePool; remove failedEntry rename now that the catch handler closes over a single entry. - Drop the single-field PoolEntry wrapper; map and methods carry Pool<Connection> directly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
|
||
| // Lazy import: snowflake-sdk has slow module-init side effects we don't | ||
| // want to pay on every app boot. | ||
| const snowflake = await import('snowflake-sdk'); |
There was a problem hiding this comment.
Note -- I'm working through whether this lazy import is actually necessary. I'm suspicious
The cross-year payload and roster integration tests drove getEduConnectionInfo via process.env (NODE_ENV=development + EDU_SNOWFLAKE_*), with a duplicated save/restore harness in each describe block. Switched to a jest spy on AppConfigService.getEduConnectionInfo — one beforeEach per block, default mockResolvedValue(null), per-test override where creds need to be present. Also dropped edu-config.spec.ts entirely: both tests only exercised the local-dev env-var fallback (never the AWS Secrets Manager path that runs in deployed environments), and the same dev-fallback shape is now implicitly covered through the AppConfigService spy seam. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Move the describe to sit after the top-level its and before the Descriptor Mappings describe, so the send-to-ODS / no-ODS pair is no longer split. - Invert the default test state: beforeEach now sets cross-year fully enabled (toggle on + creds mocked present). The happy path needs no overrides; each negative test removes exactly one condition, so a broken gate can't silently make the test pass for the wrong reason. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both the happy-path and mid-stream-error roster tests now use a streaming parser via .buffer(true).parse(streamParser). The parser collects chunks as they arrive and signals whether 'end' fired (complete) or 'close'/'error' fired first (truncated). The mid-stream test was previously written as "either we got a partial body or supertest threw, both are fine" because the abort behavior varied across Node/supertest combos. The streaming parser makes it deterministic — the test asserts complete=false and the exact captured bytes. Also dropped the test-local poolUseSpy declaration and afterEach cleanup; mocks created inside an `it` are now mockRestored at the bottom of the same `it`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Extract a mockEduPoolStream helper that stubs the pool's use() with a fake connection streaming from an Iterable/AsyncIterable. Each test's distinguishing setup is now a single line (the source stream). - Drop the binds assertion + rename the happy-path test — the bind value is literally visible in the controller, no derived behavior to verify. - Group the two stream-consuming tests into a `streaming responses` describe so the helpers (streamParser, mockEduPoolStream) live next to the tests that use them. The toggle-on prereq moves into the nested beforeEach. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Mid-stream test: move the comment to point at the complete:false assertion as the truncation indicator (set by streamParser when 'close'/'error' fires before 'end'). Drop the body shape sanity check — the exact-bytes assertion already covers it. - Pool-acquisition-fails test: drop the content-type check. If we received a 500 status cleanly, supertest got a proper response; a torn socket would surface as a client-side error instead. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
snowflake-sdk can return BigInt for high-precision NUMBER columns, and JSON.stringify throws on BigInt. A sync throw inside _transform escapes pipeline and crashes the process; routing the failure through cb(err) lets pipeline destroy the response cleanly, which the executor surfaces as the same abrupt-close it already handles. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
bhaugeea
left a comment
There was a problem hiding this comment.
Looks good: streaming is cool and the connection management seems fully baked. Some high-level comments before I look at the tests, try to run the code, and look at all the details more closely (and also before I log off until Wednesday).
| 9. **Output files**: POST output file path + `sentToOds` flag to `/output-files` callback; app validates path, lists S3, saves `run_output_file_set` | ||
| 10. **Done**: POST status `{action: DONE, status: success|failure}` | ||
| 6. **Transform**: `earthmover run` against the ODS roster (with encoding detection + retry) | ||
| 7. **Cross-year retry** (when `crossYearMatchAvailable` and the first pass produced unmatched students): GET `appUrls.roster` for the cross-year NDJSON roster, write to a `.jsonl` file, and re-run `earthmover` against it using the same ID type. Recovers students whose identifiers changed between school years. |
There was a problem hiding this comment.
Is there any case in which a row is "supposed to" get sent to the ODS, but doesn't match there and is instead sent straight to EDU? Or is there reliable alignment between this ODS/EDU fallback flowchart and the current/past year intended loading destination flowchart?
There was a problem hiding this comment.
I should have looked at this note carefully :/ The piece about identifiers changing between years is not a thing, so striking that.
But to your questions... the main thing about how the Executor coordinates ODS roster matches and EDU roster matches is that it progresses through each stage of processing for both before moving onto the next. That is, it runs Earthmover, which both performs the ID matching and transforms input files, with the ODS roster and then, if the ODS run succeeded, the EDU roster before sending results anywhere. Then, if that's successful, the Executor sends ODS-matched results to the ODS. If that succeeds, the Executor returns output files sets (for ODS and EDU matched results) to the API.
This should prevent most issues of the kind (I think) you're asking about. For example, it rules out having a file generate an "insufficient matches" error against the ODS roster (say 49% matched) but then succeeding against the EDU roster, when 49% of those matches (and probably more) really should be going to the ODS.
One issue that we don't prevent is if the file contains (a) an incorrect ID that (b) matches the ID for a different student. This possibility already exists with matching solely against the ODS, but including IDs from EDU in theory increases the probability of (b). That's a limitation of the current ID matching strategy more than how rosters are sourced. Moving to ID resolution that takes other observables (name, grade, DOB, etc) into account will be necessary to address it.
| // Evict failed creations so subsequent requests can retry. Guard against | ||
| // racing with a concurrent insertion under the same key — only delete if | ||
| // the map still points at this entry. |
There was a problem hiding this comment.
Is it possible for there to be a concurrent insertion? All the code from L84 through L88 is synchronous isn't it? L92 is not, but if many callers called before the first had rejected, only the first would make it past L85.
There was a problem hiding this comment.
Nope! Good catch. This is left over from a previous, more involved implementation, I'll remove it.
| * Maintains one Snowflake connection pool per partner, created on demand. | ||
| * The first request for a partner pays the ~20s JWT connect cost; subsequent | ||
| * requests reuse the pool until the evictor reaps idle connections. |
There was a problem hiding this comment.
Re: your comment about looking for my perspective on this general strategy: maybe I'm not thinking hard enough about it, but I don't have any huge ideas about it and this seems totally fine. The other options you explored were keeping connections warm even when they're not being used, and not keeping a long-lived pool at all?
Seems like compared to creating connections fresh every time they're needed, the extra code here is basically onModuleDestroy() and getOrCreatePool()`, and they don't seem too bad — and you're clear that usage is indeed somewhat bursty, so there's a decent chance this code will help users avoid the 20s startup delay sometimes. I wouldn't mind a little reasoning about the 60s lifetime though. I mean that's pretty quick. If it takes someone 3 minutes to go get the other assessment file they're trying to load, it's another cold start. Is the server-side timeout that quick, giving us no other option (or, am I misunderstanding something)?
The main two things that come to mind in general are the cold-start delay, which you're clear about here, and how this all does or doesn't relate to Snowflake's serverless cost structure, which I wouldn't mind some kind of mention of.
There was a problem hiding this comment.
I did a writeup here to think through the questions you're bringing up. The 60 sec idle timeout is definitely not correct. I changed to 30 min, which I think should handle the burst-iness while allowing the pool to shrink back down in a reasonable amount of time.
The TLDR on the writeup is that I think 30 mind idle timeout, a 60 sec connection acquisition timeout, and a large setting for max connections is the way to go. That allows the app to reuse connections, but also allows the Snowflake concurrent query limits to be the throttle -- trying to stay under that via connection pool limits seems unnecessary (Snowflake manages a query queue, doesn't seem to care about concurrent connections) -- we really want the connection pool to enable connection reuse while letting connections get evicted, not so much for capping the number of connections like we would with postgres.
But really, I think we just need to get something reasonable in dev and try it out under some load. My hunch is that initiating ECS tasks and the first phases of job processing introduce enough lag of varying amounts that we wouldn't actually have a ton of API jobs hitting /roster all at once, even if they're initiated all at once.
- bump EDU snowflake pool idleTimeoutMillis to 30 minutes so bursty partner usage within a half hour reuses warm connections; still well under Snowflake's 4h default programmatic-session timeout - simplify the failed-creation eviction in getOrCreatePool: there's no code path that races a concurrent insertion under the same key, so a plain delete suffices - add a Cross-Year Matching Flow subsection to AGENTS.md with a mermaid diagram describing the executor's staged ordering (both transforms, then ODS load, then API exposure) and clarifying that cross-year-matched rows reach external consumers via the Runway API, not the ODS - drop the "identifiers changing between school years" framing from the cross-year retry step; that's not the actual use case Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bhaugeea
left a comment
There was a problem hiding this comment.
Looks great, runs well, and tests well!
- bump the per-partner EDU snowflake pool max from 4 to 20 to give bursty multi-tenant API workloads more concurrent connections before callers queue on acquire - add optional warehouse and role to EDU connection info (both the local-dev env path and the Secrets Manager path); when unset they're omitted from the connection options so Snowflake applies the user's defaults, letting the app override when needed - document EDU_SNOWFLAKE_WAREHOUSE / EDU_SNOWFLAKE_ROLE in .env.copyme Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This PR is smaller than the LOC update makes it seem! It's mostly package-lock.json updates from adding snowflake-sdk:

Overview
This PR contains the backend changes to enable cross year ID matching. What does that mean? Runway will now do two rounds of ID matching (for ODS-bound files, if cross-year matching is enabled):
Unmatched IDs after these two passes will be handled as they are today (the app makes a file of unmatched IDs available for users to resolve and re-upload).
App Updates
The app's role in cross-year ID matching is:
crossyearMatchEnabled) and the ability to connect to EDU (as determined by either the presence of an active connection for that partner or EDU credentials that could be used to establish a connection).The app already supports receiving multiple output file sets per run. There are some cosmetic updates (e.g. the job view will say "all results sent to ODS" even for split jobs) but we'll take care of those in a follow on PR.
EDU Connections
How the app manages EDU connections is more elaborate than I started out with and I'm open to dialing it back. Initially, I just had the app establishing a Snowflake connection per query and then discarding it. But I noticed that the Snowflake connection can take 20+ seconds, so wanted to try pooling them (by partner) and keeping an active connection on hand. However, that ended up being more elaborate than I liked so I scaled it back to allow 0 connections in the pool... but there's still the pool for now. Runs do come in bursts (e.g. a user has a set of files to load) so there's maybe some value in connection reuse.
For current purposes, where it's just one query per job, I think using one connection per query could be fine -- the Executor doesn't mind waiting. But I do expect that managing EDU connections is something the app will need to do even after we switch to some more robust ID resolution service -- the first version of that service will likely live in Runway and need EDU connections so I think it could be worth just getting some pooling in. Still, I'm happy to pull it out if it feels like too much right now.
Follow-on work
There's some work I'm reserving for follow on PRs: frontend changes and sideloading with EDU.
Frontend updates
Not in this PR, but there will be a pane in the Admin Settings page showing whether EDU creds exist for the partner and, if so, allowing cross-year ID matching to be enabled. If we keep the approach to pooled connections, this might also allow resetting the connection (e.g. you update the creds in the AWS secrets) and want to trigger a refresh.
Additionally, we'll need to clean up some language on the Job View page that implies that if any results were sent to an ODS, all results were sent to an ODS.
What about no-ODS jobs?
With this PR, no-ODS (sideloaded) jobs will continue to use a tenant+year-specific roster file from S3. In a follow-on PR, we'll all no-ODS jobs to use EDU as a roster source if available (while preserving the roster file from S3 option if EDU is not available). This will requires some backend and frontend updates to how we check whether a no-ODS year is available. Given the mix of BE and FE changes, it felt cleanest to just pull this piece out.