Motivation
During integration testing of the SHE inference branch, it became clear that wf-psf currently assumes a fixed dataset structure across training, evaluation, and inference workflows.
This assumption is too restrictive for downstream integration scenarios where different processing stages may provide different subsets of fields at runtime.
For example:
In upcoming WaveDiff releases, the inference module will also be reused for evaluation workflows, which requires the pipeline to support multiple dataset “contracts” depending on runtime context.
Proposed Changes
Introduce a schema-driven dataset conversion and validation system allowing dataset requirements to be selected dynamically at runtime.
Key additions include:
-
Dataset schema registry for runtime validation
-
Support for multiple processing modes:
TRAIN
EVALUATION
INFERENCE
-
Separation of:
- dataset schema definitions
- field conversion handlers
- conversion context objects
-
Runtime selection of schema mode through configuration
-
Relaxation of hard-coded assumptions about required dataset fields
-
Structured logging for conversion and validation operations
New Inference Configuration Parameter
Supported modes:
-
INFERENCE
- Requires only
positions and seds
-
EVALUATION
- Expects additional fields such as
sources and optional masks
Benefits
- Enables integration with external pipelines (e.g. SHE)
- Decouples dataset representation from workflow assumptions
- Improves flexibility for future instruments and runtime configurations
- Simplifies reuse of inference components for evaluation workflows
- Makes dataset validation explicit and mode-aware
Validation
The refactor was validated by re-running:
- training
- evaluation
- inference
- mock SHE pipeline integration
with multiple dataset schemas. Results remained reproducible across workflows.
Motivation
During integration testing of the SHE inference branch, it became clear that
wf-psfcurrently assumes a fixed dataset structure across training, evaluation, and inference workflows.This assumption is too restrictive for downstream integration scenarios where different processing stages may provide different subsets of fields at runtime.
For example:
Inference workflows may only provide:
positionssedsTraining / evaluation workflows may additionally provide:
sourcesmasksIn upcoming WaveDiff releases, the inference module will also be reused for evaluation workflows, which requires the pipeline to support multiple dataset “contracts” depending on runtime context.
Proposed Changes
Introduce a schema-driven dataset conversion and validation system allowing dataset requirements to be selected dynamically at runtime.
Key additions include:
Dataset schema registry for runtime validation
Support for multiple processing modes:
TRAINEVALUATIONINFERENCESeparation of:
Runtime selection of schema mode through configuration
Relaxation of hard-coded assumptions about required dataset fields
Structured logging for conversion and validation operations
New Inference Configuration Parameter
Supported modes:
INFERENCEpositionsandsedsEVALUATIONsourcesand optional masksBenefits
Validation
The refactor was validated by re-running:
with multiple dataset schemas. Results remained reproducible across workflows.