Skip to content

[Ops]: Optimize DB-backed ingestion state queries for large performance archives #205

@tomvothecoder

Description

@tomvothecoder

Summary

Follow up on the DB-backed ingestion state work from #189 / PR #204.

Today the ingestor fetches the full /api/v1/ingestions/state payload for one machine, then compares that full machine-level state against the current archive scan locally. This is correct, but it may become inefficient as the database accumulates many more execution IDs over time.

Motivation

A concern raised in issue #189 comments was that SimBoard may eventually store tens of thousands of historical execution IDs for a machine, while a given performance_archive scan may only contain a small number of current execution directories. In that case, returning the entire machine-level state payload is more data than the ingestor actually needs.

Example:

  • DB has 20,000 stored execution IDs for a machine.
  • Current archive scan finds 20 execution IDs.
  • Current implementation still fetches the full machine-level state response before comparing locally.

This is not a correctness bug and should not block the existing PR, but it is a valid scalability concern.

Current behavior

  • Ingestor scans execution directories.
  • Ingestor fetches /api/v1/ingestions/state?machine_name=....
  • API returns all known case-path state for that machine.
  • Ingestor compares locally and submits only changed cases.

Goal

Reduce the amount of state returned to ingestors when the DB-backed state for a machine is much larger than the currently scanned archive contents.

Possible approaches

  • Add optional query filters to /api/v1/ingestions/state, such as:
    • since_created_at
    • since_execution_date or similar execution-derived lower bound
    • case-path subset filtering if the ingestor already knows the candidate case paths
  • Allow the ingestor to scan first, derive a lower bound or candidate subset, then request only relevant state.
  • Measure response-size and runtime impact before and after filtering.

Acceptance criteria

  • We can bound or reduce ingestion-state payload size for large historical machine datasets.
  • API behavior remains backwards compatible for existing ingestors unless/until the ingestor is updated to use optional filters.
  • Document the tradeoffs and chosen filter strategy.

Notes

  • This issue is about efficiency/scaling, not correctness.
  • The current DB-backed state approach was validated on NERSC and should remain the baseline source of truth.
  • Multi-machine shared archive handling should remain separate unless we discover overlap in solution design.

Metadata

Metadata

Assignees

No one assigned

    Labels

    type: opsOperation and Deployment tasks for DOE sites.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions