Skip to content

217 refactor data conversion pipeline#219

Merged
ChaitanyaChawak merged 26 commits into
developfrom
217-refactor-data-conversion-pipeline
Jun 4, 2026
Merged

217 refactor data conversion pipeline#219
ChaitanyaChawak merged 26 commits into
developfrom
217-refactor-data-conversion-pipeline

Conversation

@jeipollack

@jeipollack jeipollack commented May 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Introduce schema-driven dataset conversion and domain-specific conversion contexts to support flexible runtime dataset contracts across training, evaluation, and inference workflows.

This PR refactors the dataset conversion pipeline to decouple dataset validation from field-specific processing logic, enabling workflows with different dataset requirements (e.g. SHE inference vs evaluation/training datasets).

Closes #217
Closes #218

What’s changed

  • Add schema-driven dataset conversion infrastructure

  • Introduce runtime-selectable dataset modes:

    • TRAIN
    • EVALUATION
    • INFERENCE
  • Add dataset schema registry and validation handling

  • Introduce specialized conversion handlers module

  • Add ConversionContext and SEDContext abstractions

  • Refactor SED processing using domain-specific handler contexts

  • Remove hardcoded required field assumptions from conversion workflows

  • Refactor dataset constants to support schema-driven conversion and handler dispatch

  • Add configurable schema_mode to inference configuration

  • Improve logging throughout dataset conversion and inference flows

  • Simplify configuration namespace handling

  • Refactor TensorFlow conversion tests into schema-based contract tests

  • Improve typing for lazily initialized adapters and internal pipeline state


How to test / verify

  • Run training workflow using standard training configuration

  • Run evaluation workflow using DatasetMode.EVALUATION

  • Run inference workflow using DatasetMode.INFERENCE

  • Run inference workflow using reduced datasets containing only:

    • positions
    • seds
  • Execute unit and integration tests:

pytest src/wf_psf/tests/
  • Verify reproducibility of:

    • training
    • evaluation
    • inference outputs

Scope

  • Feature
  • Bug fix
  • Hotfix
  • Documentation / process change
  • Internal / refactor
  • Release

This PR is part of the broader effort to improve runtime flexibility and external pipeline interoperability for WaveDiff and downstream integrations (e.g. SHE workflows).


Changelog

  • Changelog fragment added (if applicable)

Reviewer Checklist

  • The PR targets the correct base branch (develop, or main for release PRs)
  • The PR is assigned to the developer
  • Appropriate labels are applied
  • The PR is included in relevant projects and/or milestones
  • Description clearly explains what has changed
  • Issue references included, if applicable
  • Code and documentation adhere to current standards (ruff)
  • Documentation updates included, if relevant
  • CI tests are passing
  • All reviewer comments have been addressed

Next Steps / Notes (if applicable)

  • Evaluate future extension of conversion contexts for additional instrument domains
  • Continue reducing hardcoded assumptions in dataset preparation workflows
  • Assess whether schema validation should eventually move earlier into dataset loading stages

@jeipollack jeipollack self-assigned this May 21, 2026
@jeipollack jeipollack added the enhancement New feature or request label May 21, 2026
@jeipollack jeipollack force-pushed the 217-refactor-data-conversion-pipeline branch from 1189d6a to d3bb9c1 Compare May 21, 2026 07:43
@jeipollack

Copy link
Copy Markdown
Contributor Author

I need to update the CI pipeline because of broken compatibility with python 3.9.

@jeipollack

Copy link
Copy Markdown
Contributor Author

@ChaitanyaChawak I fixed the CI error. I dropped python 3.9 and python 3.12 from the matrix. 3.9 is too old for tensorflow 2.15 and 3.12 is too new.

what remains is the changelog. I will work on it now.

@jeipollack

Copy link
Copy Markdown
Contributor Author

hello @ChaitanyaChawak , I pushed the changelog. all is done on my side. :)

@jeipollack

Copy link
Copy Markdown
Contributor Author

actually, I lied. I am checking the API documentation.

Jennifer Pollack added 4 commits May 27, 2026 18:08
- Update parameter description for clarity
- Update installation guide on environment and dependencies
- Update configuration guide section with new schema_mode parameter
@jeipollack jeipollack added the documentation Improvements or additions to documentation label May 27, 2026
@ChaitanyaChawak ChaitanyaChawak merged commit 689674f into develop Jun 4, 2026
2 checks passed
@ChaitanyaChawak ChaitanyaChawak deleted the 217-refactor-data-conversion-pipeline branch June 4, 2026 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

Development

Successfully merging this pull request may close these issues.

2 participants