Skip to content

App/cross year matching 1#62

Merged
edandylytics merged 46 commits into
developmentfrom
app/cross-year-matching-1
May 28, 2026
Merged

App/cross year matching 1#62
edandylytics merged 46 commits into
developmentfrom
app/cross-year-matching-1

Conversation

@edandylytics
Copy link
Copy Markdown
Collaborator

@edandylytics edandylytics commented May 14, 2026

This PR is smaller than the LOC update makes it seem! It's mostly package-lock.json updates from adding snowflake-sdk:
image

Overview

This PR contains the backend changes to enable cross year ID matching. What does that mean? Runway will now do two rounds of ID matching (for ODS-bound files, if cross-year matching is enabled):

  1. Match IDs against a roster sourced from the ODS. Any records that match on this pass will be sent to the ODS
  2. Match IDs against a cross-year roster sourced from EDU. This pass contains just IDs that were not matched in the previous pass. Any matches against the EDU roster will be processed and made available for sideloading to EDU.

Unmatched IDs after these two passes will be handled as they are today (the app makes a file of unmatched IDs available for users to resolve and re-upload).

App Updates

The app's role in cross-year ID matching is:

  • Send the Execytor a flag indicating whether a cross-year roster is available. This is set based on a partner-level setting (crossyearMatchEnabled) and the ability to connect to EDU (as determined by either the presence of an active connection for that partner or EDU credentials that could be used to establish a connection).
  • Provide an endpoint for the executor to request the cross-year roster. This endpoint queries EDU and streams results back to the Executor. We have the app make the query because it's not exposed to untrusted input files and bundle code--the Executor should be sand-boxed as much as possible

The app already supports receiving multiple output file sets per run. There are some cosmetic updates (e.g. the job view will say "all results sent to ODS" even for split jobs) but we'll take care of those in a follow on PR.

EDU Connections

How the app manages EDU connections is more elaborate than I started out with and I'm open to dialing it back. Initially, I just had the app establishing a Snowflake connection per query and then discarding it. But I noticed that the Snowflake connection can take 20+ seconds, so wanted to try pooling them (by partner) and keeping an active connection on hand. However, that ended up being more elaborate than I liked so I scaled it back to allow 0 connections in the pool... but there's still the pool for now. Runs do come in bursts (e.g. a user has a set of files to load) so there's maybe some value in connection reuse.

For current purposes, where it's just one query per job, I think using one connection per query could be fine -- the Executor doesn't mind waiting. But I do expect that managing EDU connections is something the app will need to do even after we switch to some more robust ID resolution service -- the first version of that service will likely live in Runway and need EDU connections so I think it could be worth just getting some pooling in. Still, I'm happy to pull it out if it feels like too much right now.

Follow-on work

There's some work I'm reserving for follow on PRs: frontend changes and sideloading with EDU.

Frontend updates

Not in this PR, but there will be a pane in the Admin Settings page showing whether EDU creds exist for the partner and, if so, allowing cross-year ID matching to be enabled. If we keep the approach to pooled connections, this might also allow resetting the connection (e.g. you update the creds in the AWS secrets) and want to trigger a refresh.

Additionally, we'll need to clean up some language on the Job View page that implies that if any results were sent to an ODS, all results were sent to an ODS.

What about no-ODS jobs?

With this PR, no-ODS (sideloaded) jobs will continue to use a tenant+year-specific roster file from S3. In a follow-on PR, we'll all no-ODS jobs to use EDU as a roster source if available (while preserving the roster file from S3 option if EDU is not available). This will requires some backend and frontend updates to how we check whether a no-ODS year is available. Given the mix of BE and FE changes, it felt cleanest to just pull this piece out.

edandylytics and others added 16 commits May 13, 2026 13:51
per-partner toggle for cross-year student ID matching. defaults to
false so existing partners are unchanged. temporary feature until an
ID resolution service is in place.
specifies eduCredsExist + getEduConnectionInfo. covers local env-var
fallback path for development. AWS Secrets Manager path is exercised
in deployed environments and not unit-tested here.

failing — methods not yet implemented.
getEduConnectionInfo reads from EDU_SNOWFLAKE_* env vars when present
(local dev), otherwise looks up the AWS secret edu-connection-info-<partnerId>.
returns null when nothing is available so callers can degrade gracefully —
cross-year matching is best-effort, not required for a run to proceed.

eduCredsExist is a boolean wrapper for use in payload assembly without
leaking the creds themselves.
specifies the contract between app and executor:
- crossYearMatchAvailable is always present
- when toggle on AND creds exist: flag=true, appUrls.roster set
- when either is missing: flag=false, no roster URL

failing — payload assembly doesn't read the toggle yet.
EarthbeamApiJobResponseDto gains crossYearMatchAvailable (always
present, boolean) and an optional appUrls.roster. the flag is true
iff the partner has cross-year matching enabled AND EDU creds are
available; otherwise jobs proceed without cross-year matching.

the roster URL is omitted entirely when the flag is false so the
executor can't accidentally call a disabled endpoint.
specifies the streaming roster endpoint:
- 401 for unauthenticated requests (guard)
- 409 when toggle off or creds missing
- 200 + application/x-ndjson when both prerequisites are met
- abrupt close on mid-stream errors (no in-band sentinel)

the service-level streamCrossYearRoster stub is in place so the
controller can be wired and the test suite compiles; implementation
lands in the green step.
GET /earthbeam/jobs/:runId/roster streams the partner's cross-year
roster to the executor as NDJSON. Reuses the existing executor JWT
guard; returns 409 if the toggle is off or EDU creds are missing.

The Snowflake client is lazy-imported on first use — snowflake-sdk
has slow module-init side effects we shouldn't pay on every app boot,
and cross-year fetches are a rare hot path. Connection is per-request
and closed in finally.

Mid-stream errors abort the response (no in-band sentinel); the
executor detects truncation and fails the run.

The query is a TODO placeholder until the data engineer supplies the
real one — controller wiring, auth, and error semantics are all in
place so the swap is one-line.
Move the Snowflake staging schema into the EDU connection payload, use the real cross-year roster query, and make partner-specific AWS secrets take precedence over local env fallbacks. This keeps Snowflake config in one place and restores the intended secret-first behavior for deployed environments.

Co-authored-by: Codex <codex@openai.com>
Use NODE_ENV to gate the EDU Snowflake env-var fallback so deployed environments rely on partner-specific AWS secrets. If the secret is missing or incomplete outside local development, return null instead of silently falling back.

Co-authored-by: Codex <codex@openai.com>
Add the local-development EDU Snowflake fallback variables to the API env template so cross-year roster testing can be configured without guessing the required keys.

Co-authored-by: Codex <codex@openai.com>
Clarify in the API env template that the local EDU Snowflake key values should contain base64-encoded PEM contents, with quick examples for file and clipboard encoding.

Co-authored-by: Codex <codex@openai.com>
The previous `hostname.split('.')[0]` only kept the leading subdomain,
dropping the region segment for URLs like `<account>.<region>.snowflakecomputing.com`
and causing the SDK's JWT auth to fail. Strip the `.snowflakecomputing.com`
suffix instead so the full account locator is preserved.

Co-authored-by: Cursor <cursoragent@cursor.com>
Previously we accepted a Snowflake URL and parsed an `account` out of it,
and used the schema by interpolating it into the SQL query. Switching to
discrete `account`, `database`, and `schema` values that get passed to
`snowflake.createConnection` removes the URL parsing, drops the
SQL-identifier validator (no longer interpolated), and lets the SDK
namespace the query.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Opening a Snowflake connection takes ~20s — paying that on every roster
request was the dominant cost. Introduce EduSnowflakePoolService that
keeps one snowflake-sdk pool per partner (min 1, max 4). On module init
we fire-and-forget warm a pool for every partner with cross-year
matching enabled, so first-request latency disappears for warm pools.
Requests for partners without a warmed pool create one on demand and
share the same in-flight promise across concurrent callers. Failed pool
creation is evicted so the next request retries.

streamCrossYearRoster now goes through eduPool.use(partnerId, cb)
instead of opening and tearing down its own connection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The env-var fallback in AppConfigService.getEduConnectionInfo is gated
on isDevEnvironment() (NODE_ENV === 'development'). Jest defaults
NODE_ENV to 'test', so the tests stopped exercising the env-var path
once the gate was added — eduCredsExist returned false and the payload
omitted the roster URL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
the previous write loop discarded response.write's return value, which
meant a slow executor would cause node to buffer the entire roster in
memory. switch to stream.pipeline so node manages backpressure end to
end; pipeline also destroys downstream streams on error, preserving the
abrupt-close behavior on mid-stream snowflake failures.
getEduConnectionInfo previously collapsed every secrets-manager error
(IAM, throttling, network, malformed JSON, missing) to null, which
collapsed to a 409 from the roster endpoint and crossYearMatchAvailable
false in the job payload. operationally indistinguishable from "feature
off."

now it returns null only for ResourceNotFoundException and logs +
rethrows everything else. eduCredsExist (used at payload assembly time)
still swallows so unrelated run creation degrades gracefully. the roster
endpoint's prereq check switches to calling getEduConnectionInfo
directly so a real aws outage produces a 5xx, not 409.
the two streaming tests used to mock streamCrossYearRoster itself, so
they never exercised the pipeline, transform, sql, or binds. they now
spy on EduSnowflakePoolService.use and supply a fake connection whose
execute() returns a Readable, letting the real service body run.

happy path asserts on the captured execute args (sqlText targets the
EDU staging tables, binds is [tenantCode], streamResult is true) plus
the NDJSON body. mid-stream-error asserts the response truncated after
exactly one row with no in-band sentinel, replacing a test that
asserted nothing.
the catch handler unconditionally deleted the partner key, which could
clobber a newly-inserted (in-flight) pool entry under the same key.
only delete if the map still points at the failed entry.
snowflake key-pair auth uses only the private key; the public half is
uploaded to snowflake out-of-band. carrying publicKey through the
config layer added nothing and the "missing publicKey → null" gate
would have silently disabled the feature for an otherwise-valid secret.

removed from the EduConnectionInfo type, both secret and env-var
fallbacks, the validation check, .env.copyme, and the test fixtures.
the executor's conditional roster-fetch step was missing from the
Executor Lifecycle section. also add the DTO file to the App↔Executor
key files. project policy in AGENTS.md itself requires doc updates
alongside behavior changes.
generic-pool defaults acquireTimeoutMillis to forever, so under
saturation (5+ concurrent runs for the same partner) requests could
hang indefinitely. 60s is generous over the ~20s JWT connect observed
locally and gives a clear timeout error instead of an unbounded wait.
without onModuleDestroy, open snowflake sockets outlived the process on
SIGTERM (beanstalk/ECS deploys), risking hung shutdowns past the
platform grace period.

drains each pool with a 10s race timeout under Promise.allSettled so
one stuck pool can't block the others.
the abrupt-close decision was for mid-stream failures (after rows have
been written). when pool acquisition or connection.execute fails before
the first row, no bytes are flushed yet — tearing the TCP connection
just makes it harder for the executor to diagnose.

the controller now checks res.headersSent and throws
InternalServerErrorException when nothing has been sent, falling back
to res.destroy only after headers are out. test covers the pre-stream
case.
generic-pool's pool.use() releases the resource on both resolve and
reject, so a connection whose row stream errored mid-flight gets handed
to the next acquirer as healthy. switch to explicit acquire + release
on success / destroy on error so the suspect connection leaves the pool.
the private async getAWSSecret signature lost its 2-space class-member
indent — re-indent to match neighbors.
the existing entry only mentioned cross-year as a side note on the
ODS roster fetch step, which implied a single transform pass that
includes the cross-year roster up front. the actual flow is a second
earthmover pass triggered only when the first pass leaves unmatched
students and the partner has the feature enabled.
eduCredsExist was a thin wrapper that called getEduConnectionInfo, did
!== null, and swallowed errors as "false". the swallow-on-error decision
is specific to job-payload assembly (don't break run creation if Secrets
Manager hiccups) — moving it to that one caller keeps app-config focused
on fetching creds rather than also owning what to do when fetching fails.

drops the wrapper and its two thin tests; the real fetch is still
covered by the getEduConnectionInfo tests and by the cross-year payload
tests in earthbeam-api.spec.ts.
inline the env-var fallback path (matching how postgresPoolConfig and
encryptionKey already handle their AWS/env split) and drop the
EduConnectionInfo named type — callers infer the shape. brings the EDU
method into line with the rest of AppConfigService.
@edandylytics edandylytics force-pushed the app/cross-year-matching-1 branch from 2b9c110 to b02fa5d Compare May 20, 2026 16:56
motivation:
- the central secrets cache in AppConfigService meant cred rotation
  required a process restart on top of the pool refresh PR 2 already
  plans.
- "do creds exist?" was answered by AppConfigService, which had no
  natural way to bypass its own cache for the EDU path.
- payload assembly and the roster endpoint were both asking AppConfig
  about EDU, even though the pool service is the thing that actually
  uses the creds.

changes:
- AppConfigService.getAWSSecret now wraps a new uncached
  fetchAWSSecret; postgres/encryption/jwt callers keep cached
  semantics, EDU calls fetchAWSSecret directly. one cache for one
  concern.
- EduSnowflakePoolService gains a public canConnect(partnerId) that
  answers from a resolvedPools set when a pool is already up (no AWS
  call needed) or falls back to a fresh getEduConnectionInfo lookup.
  failures are swallowed + logged so payload assembly can degrade.
- earthbeamInputForRun and getCrossYearRosterContext both ask the pool
  service instead of AppConfig.

the 5xx-vs-409 distinction at the roster endpoint becomes 409 for any
"can't connect" reason; real failures during streaming still propagate
through the controller's headersSent-aware catch (500 pre-stream,
abrupt close mid-stream).
getCrossYearRosterContext was calling eduPool.canConnect to decide
between 409 and proceeding to stream. with the secrets cache gone for
EDU, that meant a cold roster request did two uncached AWS lookups —
one for the prereq check, one for createPool inside the stream.

drop the check. the toggle check still answers "feature off for this
partner" (409). everything else flows through the stream attempt; pool
creation failures hit the controller's headersSent-aware catch and
turn into a clean 500. the existing "missing EDU creds" test moves
from 409 to 500 (and still exercises the real pool-creation failure
path, unlike the mocked-pool 500 test).
if (cachedValue) {
return cachedValue;
}
const secretValue = await this.fetchAWSSecret(secretName);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like how AppConfigService caches AWS secrets. For the EDU creds secret, I want to make it easy to grab fresh values in the secret without restarting the app and caching would prevent that.

For now, I just decided to factor fetching the secret out of caching since there are other callers of getAWSSecret and we haven't had any actual issues with caching. Long term, I think it makes sense to remove caching entirely from AppConfigService. Callers should instead manage their own caches if they want to have a cache at all.

edandylytics and others added 5 commits May 22, 2026 10:52
So multiple environments (dev/stg/prod) sharing one AWS account can
keep distinct EDU connection secrets per partner. Throws if ENVLABEL
is unset, matching the existing ECS-config helper.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Method had one caller and most of its LOC were SUCCESS/ERROR object
construction. The controller now does the run lookup + toggle check
directly. partnerId/tenantCode are read from run.job scalars for
symmetry between the two args.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Single caller, just a query + response. Controller now injects
EduSnowflakePoolService directly and owns the full Snowflake-to-NDJSON
pipeline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Snowflake server-side session timeouts would silently kill the min:1
warm connection after hours of idle, causing the next request to fail
on a stale handle. Switched to min:0 with eviction every 60s on a 60s
idle threshold, per Snowflake's docs.

With min:0 the startup warm-up creates pool objects but no actual
sockets, so it became dead weight — removed warmPools, OnModuleInit,
and the now-unused PRISMA_READ_ONLY dep. First request per partner
unconditionally pays the ~20s JWT connect cost; fine for a temporary
feature on low-traffic partners.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Drop resolvedPools — pools.has gives the same fast-path for canConnect
  with a benign in-flight-failure race.
- Early-return in getOrCreatePool; remove failedEntry rename now that the
  catch handler closes over a single entry.
- Drop the single-field PoolEntry wrapper; map and methods carry
  Pool<Connection> directly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@edandylytics edandylytics marked this pull request as ready for review May 22, 2026 18:05
@edandylytics edandylytics requested a review from bhaugeea May 22, 2026 18:05

// Lazy import: snowflake-sdk has slow module-init side effects we don't
// want to pay on every app boot.
const snowflake = await import('snowflake-sdk');
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note -- I'm working through whether this lazy import is actually necessary. I'm suspicious

edandylytics and others added 6 commits May 22, 2026 13:39
The cross-year payload and roster integration tests drove getEduConnectionInfo
via process.env (NODE_ENV=development + EDU_SNOWFLAKE_*), with a duplicated
save/restore harness in each describe block. Switched to a jest spy on
AppConfigService.getEduConnectionInfo — one beforeEach per block, default
mockResolvedValue(null), per-test override where creds need to be present.

Also dropped edu-config.spec.ts entirely: both tests only exercised the
local-dev env-var fallback (never the AWS Secrets Manager path that runs
in deployed environments), and the same dev-fallback shape is now
implicitly covered through the AppConfigService spy seam.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Move the describe to sit after the top-level its and before the
  Descriptor Mappings describe, so the send-to-ODS / no-ODS pair is no
  longer split.
- Invert the default test state: beforeEach now sets cross-year fully
  enabled (toggle on + creds mocked present). The happy path needs no
  overrides; each negative test removes exactly one condition, so a
  broken gate can't silently make the test pass for the wrong reason.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both the happy-path and mid-stream-error roster tests now use a
streaming parser via .buffer(true).parse(streamParser). The parser
collects chunks as they arrive and signals whether 'end' fired
(complete) or 'close'/'error' fired first (truncated).

The mid-stream test was previously written as "either we got a
partial body or supertest threw, both are fine" because the abort
behavior varied across Node/supertest combos. The streaming parser
makes it deterministic — the test asserts complete=false and the
exact captured bytes.

Also dropped the test-local poolUseSpy declaration and afterEach
cleanup; mocks created inside an `it` are now mockRestored at the
bottom of the same `it`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Extract a mockEduPoolStream helper that stubs the pool's use() with a
  fake connection streaming from an Iterable/AsyncIterable. Each test's
  distinguishing setup is now a single line (the source stream).
- Drop the binds assertion + rename the happy-path test — the bind value
  is literally visible in the controller, no derived behavior to verify.
- Group the two stream-consuming tests into a `streaming responses`
  describe so the helpers (streamParser, mockEduPoolStream) live next
  to the tests that use them. The toggle-on prereq moves into the
  nested beforeEach.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Mid-stream test: move the comment to point at the complete:false
  assertion as the truncation indicator (set by streamParser when
  'close'/'error' fires before 'end'). Drop the body shape sanity
  check — the exact-bytes assertion already covers it.
- Pool-acquisition-fails test: drop the content-type check. If we
  received a 500 status cleanly, supertest got a proper response; a
  torn socket would surface as a client-side error instead.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
snowflake-sdk can return BigInt for high-precision NUMBER columns, and
JSON.stringify throws on BigInt. A sync throw inside _transform escapes
pipeline and crashes the process; routing the failure through cb(err)
lets pipeline destroy the response cleanly, which the executor surfaces
as the same abrupt-close it already handles.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@bhaugeea bhaugeea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good: streaming is cool and the connection management seems fully baked. Some high-level comments before I look at the tests, try to run the code, and look at all the details more closely (and also before I log off until Wednesday).

Comment thread AGENTS.md Outdated
9. **Output files**: POST output file path + `sentToOds` flag to `/output-files` callback; app validates path, lists S3, saves `run_output_file_set`
10. **Done**: POST status `{action: DONE, status: success|failure}`
6. **Transform**: `earthmover run` against the ODS roster (with encoding detection + retry)
7. **Cross-year retry** (when `crossYearMatchAvailable` and the first pass produced unmatched students): GET `appUrls.roster` for the cross-year NDJSON roster, write to a `.jsonl` file, and re-run `earthmover` against it using the same ID type. Recovers students whose identifiers changed between school years.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any case in which a row is "supposed to" get sent to the ODS, but doesn't match there and is instead sent straight to EDU? Or is there reliable alignment between this ODS/EDU fallback flowchart and the current/past year intended loading destination flowchart?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have looked at this note carefully :/ The piece about identifiers changing between years is not a thing, so striking that.

But to your questions... the main thing about how the Executor coordinates ODS roster matches and EDU roster matches is that it progresses through each stage of processing for both before moving onto the next. That is, it runs Earthmover, which both performs the ID matching and transforms input files, with the ODS roster and then, if the ODS run succeeded, the EDU roster before sending results anywhere. Then, if that's successful, the Executor sends ODS-matched results to the ODS. If that succeeds, the Executor returns output files sets (for ODS and EDU matched results) to the API.

This should prevent most issues of the kind (I think) you're asking about. For example, it rules out having a file generate an "insufficient matches" error against the ODS roster (say 49% matched) but then succeeding against the EDU roster, when 49% of those matches (and probably more) really should be going to the ODS.

One issue that we don't prevent is if the file contains (a) an incorrect ID that (b) matches the ID for a different student. This possibility already exists with matching solely against the ODS, but including IDs from EDU in theory increases the probability of (b). That's a limitation of the current ID matching strategy more than how rosters are sourced. Moving to ID resolution that takes other observables (name, grade, DOB, etc) into account will be necessary to address it.

Comment on lines +89 to +91
// Evict failed creations so subsequent requests can retry. Guard against
// racing with a concurrent insertion under the same key — only delete if
// the map still points at this entry.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for there to be a concurrent insertion? All the code from L84 through L88 is synchronous isn't it? L92 is not, but if many callers called before the first had rejected, only the first would make it past L85.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope! Good catch. This is left over from a previous, more involved implementation, I'll remove it.

Comment on lines +6 to +8
* Maintains one Snowflake connection pool per partner, created on demand.
* The first request for a partner pays the ~20s JWT connect cost; subsequent
* requests reuse the pool until the evictor reaps idle connections.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: your comment about looking for my perspective on this general strategy: maybe I'm not thinking hard enough about it, but I don't have any huge ideas about it and this seems totally fine. The other options you explored were keeping connections warm even when they're not being used, and not keeping a long-lived pool at all?

Seems like compared to creating connections fresh every time they're needed, the extra code here is basically onModuleDestroy() and getOrCreatePool()`, and they don't seem too bad — and you're clear that usage is indeed somewhat bursty, so there's a decent chance this code will help users avoid the 20s startup delay sometimes. I wouldn't mind a little reasoning about the 60s lifetime though. I mean that's pretty quick. If it takes someone 3 minutes to go get the other assessment file they're trying to load, it's another cold start. Is the server-side timeout that quick, giving us no other option (or, am I misunderstanding something)?

The main two things that come to mind in general are the cold-start delay, which you're clear about here, and how this all does or doesn't relate to Snowflake's serverless cost structure, which I wouldn't mind some kind of mention of.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a writeup here to think through the questions you're bringing up. The 60 sec idle timeout is definitely not correct. I changed to 30 min, which I think should handle the burst-iness while allowing the pool to shrink back down in a reasonable amount of time.

The TLDR on the writeup is that I think 30 mind idle timeout, a 60 sec connection acquisition timeout, and a large setting for max connections is the way to go. That allows the app to reuse connections, but also allows the Snowflake concurrent query limits to be the throttle -- trying to stay under that via connection pool limits seems unnecessary (Snowflake manages a query queue, doesn't seem to care about concurrent connections) -- we really want the connection pool to enable connection reuse while letting connections get evicted, not so much for capping the number of connections like we would with postgres.

But really, I think we just need to get something reasonable in dev and try it out under some load. My hunch is that initiating ECS tasks and the first phases of job processing introduce enough lag of varying amounts that we wouldn't actually have a ton of API jobs hitting /roster all at once, even if they're initiated all at once.

- bump EDU snowflake pool idleTimeoutMillis to 30 minutes so bursty
  partner usage within a half hour reuses warm connections; still well
  under Snowflake's 4h default programmatic-session timeout
- simplify the failed-creation eviction in getOrCreatePool: there's no
  code path that races a concurrent insertion under the same key, so a
  plain delete suffices
- add a Cross-Year Matching Flow subsection to AGENTS.md with a mermaid
  diagram describing the executor's staged ordering (both transforms,
  then ODS load, then API exposure) and clarifying that
  cross-year-matched rows reach external consumers via the Runway API,
  not the ODS
- drop the "identifiers changing between school years" framing from the
  cross-year retry step; that's not the actual use case

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@bhaugeea bhaugeea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, runs well, and tests well!

- bump the per-partner EDU snowflake pool max from 4 to 20 to give
  bursty multi-tenant API workloads more concurrent connections before
  callers queue on acquire
- add optional warehouse and role to EDU connection info (both the
  local-dev env path and the Secrets Manager path); when unset they're
  omitted from the connection options so Snowflake applies the user's
  defaults, letting the app override when needed
- document EDU_SNOWFLAKE_WAREHOUSE / EDU_SNOWFLAKE_ROLE in .env.copyme

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@edandylytics edandylytics requested a review from bhaugeea May 27, 2026 22:22
@edandylytics edandylytics merged commit edf8ae0 into development May 28, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants