Skip to content

[Enhancement]: Add state-first HPC upload ingestion flow with DB-backed dedupe parity #207

@tomvothecoder

Description

@tomvothecoder

Is your feature request related to a problem?

For non-NERSC sites that must upload archives over HTTPS, SimBoard cannot currently follow the same state-first dedupe flow used by the NERSC path ingestor.

Current gaps:

  • /api/v1/ingestions/state only reconstructs state from HPC_PATH ingestions.
  • /api/v1/ingestions/from-upload does not accept caller-provided processed_execution_ids or stable case identity metadata.
  • Upload ingestions are currently recorded as BROWSER_UPLOAD, not a dedicated HPC automation source.
  • There is no upload-side automation runner that fetches DB-backed state before deciding which cases to submit.

Result: upload-based HPC automation cannot skip unchanged cases before ingestion, and cannot contribute back to the same DB-backed dedupe state model.

Describe the solution you'd like

Implement a dedicated HPC upload ingestion flow that mirrors NERSC path-ingestion behavior, using one case per upload request.

Proposed scope:

  • Add a dedicated HPC upload ingestion contract or endpoint.
  • Require one case archive per request plus stable case identity metadata.
  • Accept processed_execution_ids from the upload-side automation job.
  • Persist those ingestions as HPC_UPLOAD.
  • Extend /api/v1/ingestions/state so it includes both HPC_PATH and HPC_UPLOAD rows.
  • Add an upload-side automation runner that:
    • fetches /api/v1/ingestions/state
    • scans local archive contents
    • skips unchanged cases
    • uploads only changed cases
  • Document the distinction between browser/manual upload and automated HPC upload.

Acceptance criteria:

  • A service-account-driven upload workflow can fetch known execution state before upload.
  • HPC upload requests can persist per-case processed_execution_ids needed for future dedupe decisions.
  • /api/v1/ingestions/state reconstructs state from both HPC_PATH and HPC_UPLOAD records.
  • Unchanged cases are skipped by the upload automation runner.
  • Browser/manual upload behavior remains supported and clearly separated from HPC automation behavior.

Describe alternatives you've considered

  1. Reuse current browser upload flow.
    This is not enough because the current upload contract does not carry state metadata and persists uploads as BROWSER_UPLOAD.

  2. Support multi-case HPC upload requests.
    This would require a more complex persistence model because current ingestion audit rows only store one source_reference and one flat processed_execution_ids list, which is not sufficient for reconstructing per-case state from a multi-case upload.

  3. Keep upload flow as-is and rely on duplicate detection during ingest.
    This preserves correctness at the execution_id level, but loses the main benefit of state-first dedupe: skipping unchanged cases before upload and ingestion.

Additional context

Recommendation: implement this in a separate PR from docs-only clarification work.

Reason for preferring one case per HPC upload request:

  • Matches current state model keyed by case path.
  • Minimizes schema and audit-model complexity.
  • Keeps browser/manual multi-case uploads separate from HPC automation semantics.
  • Lowest-risk path to parity with the existing NERSC runner.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions