Skip to content

[REFACTOR] Refactor dataset conversion pipeline to support runtime-selectable data schemas #217

Description

@jeipollack

Motivation

During integration testing of the SHE inference branch, it became clear that wf-psf currently assumes a fixed dataset structure across training, evaluation, and inference workflows.

This assumption is too restrictive for downstream integration scenarios where different processing stages may provide different subsets of fields at runtime.

For example:

  • Inference workflows may only provide:

    • positions
    • seds
  • Training / evaluation workflows may additionally provide:

    • sources
    • masks

In upcoming WaveDiff releases, the inference module will also be reused for evaluation workflows, which requires the pipeline to support multiple dataset “contracts” depending on runtime context.

Proposed Changes

Introduce a schema-driven dataset conversion and validation system allowing dataset requirements to be selected dynamically at runtime.

Key additions include:

  • Dataset schema registry for runtime validation

  • Support for multiple processing modes:

    • TRAIN
    • EVALUATION
    • INFERENCE
  • Separation of:

    • dataset schema definitions
    • field conversion handlers
    • conversion context objects
  • Runtime selection of schema mode through configuration

  • Relaxation of hard-coded assumptions about required dataset fields

  • Structured logging for conversion and validation operations

New Inference Configuration Parameter

schema_mode: INFERENCE

Supported modes:

  • INFERENCE

    • Requires only positions and seds
  • EVALUATION

    • Expects additional fields such as sources and optional masks

Benefits

  • Enables integration with external pipelines (e.g. SHE)
  • Decouples dataset representation from workflow assumptions
  • Improves flexibility for future instruments and runtime configurations
  • Simplifies reuse of inference components for evaluation workflows
  • Makes dataset validation explicit and mode-aware

Validation

The refactor was validated by re-running:

  • training
  • evaluation
  • inference
  • mock SHE pipeline integration

with multiple dataset schemas. Results remained reproducible across workflows.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Fields

No fields configured for Feature.

Projects

Status
No status

Relationships

None yet

Development

No branches or pull requests

Issue actions