Summary
Follow up on the DB-backed ingestion state work from #189 / PR #204.
Today the ingestor fetches the full /api/v1/ingestions/state payload for one machine, then compares that full machine-level state against the current archive scan locally. This is correct, but it may become inefficient as the database accumulates many more execution IDs over time.
Motivation
A concern raised in issue #189 comments was that SimBoard may eventually store tens of thousands of historical execution IDs for a machine, while a given performance_archive scan may only contain a small number of current execution directories. In that case, returning the entire machine-level state payload is more data than the ingestor actually needs.
Example:
- DB has 20,000 stored execution IDs for a machine.
- Current archive scan finds 20 execution IDs.
- Current implementation still fetches the full machine-level state response before comparing locally.
This is not a correctness bug and should not block the existing PR, but it is a valid scalability concern.
Current behavior
- Ingestor scans execution directories.
- Ingestor fetches
/api/v1/ingestions/state?machine_name=....
- API returns all known case-path state for that machine.
- Ingestor compares locally and submits only changed cases.
Goal
Reduce the amount of state returned to ingestors when the DB-backed state for a machine is much larger than the currently scanned archive contents.
Possible approaches
- Add optional query filters to
/api/v1/ingestions/state, such as:
since_created_at
since_execution_date or similar execution-derived lower bound
- case-path subset filtering if the ingestor already knows the candidate case paths
- Allow the ingestor to scan first, derive a lower bound or candidate subset, then request only relevant state.
- Measure response-size and runtime impact before and after filtering.
Acceptance criteria
- We can bound or reduce ingestion-state payload size for large historical machine datasets.
- API behavior remains backwards compatible for existing ingestors unless/until the ingestor is updated to use optional filters.
- Document the tradeoffs and chosen filter strategy.
Notes
- This issue is about efficiency/scaling, not correctness.
- The current DB-backed state approach was validated on NERSC and should remain the baseline source of truth.
- Multi-machine shared archive handling should remain separate unless we discover overlap in solution design.
Summary
Follow up on the DB-backed ingestion state work from #189 / PR #204.
Today the ingestor fetches the full
/api/v1/ingestions/statepayload for one machine, then compares that full machine-level state against the current archive scan locally. This is correct, but it may become inefficient as the database accumulates many more execution IDs over time.Motivation
A concern raised in issue #189 comments was that SimBoard may eventually store tens of thousands of historical execution IDs for a machine, while a given
performance_archivescan may only contain a small number of current execution directories. In that case, returning the entire machine-level state payload is more data than the ingestor actually needs.Example:
This is not a correctness bug and should not block the existing PR, but it is a valid scalability concern.
Current behavior
/api/v1/ingestions/state?machine_name=....Goal
Reduce the amount of state returned to ingestors when the DB-backed state for a machine is much larger than the currently scanned archive contents.
Possible approaches
/api/v1/ingestions/state, such as:since_created_atsince_execution_dateor similar execution-derived lower boundAcceptance criteria
Notes