217 refactor data conversion pipeline#219
Merged
Merged
Conversation
added 11 commits
May 21, 2026 09:42
Introduce dedicated handlers module with registry for specialized dataset transformation logic used during schema-driven conversion.
quality - Add configurable schema_mode for inference workflows - Simplify nested inference and training configuration namespaces - Add structured logging for inference pipeline operations - Replace untyped adapter initialization with explicit optional type hints
1189d6a to
d3bb9c1
Compare
Contributor
Author
|
I need to update the CI pipeline because of broken compatibility with python 3.9. |
added 8 commits
May 21, 2026 09:55
…tor objects and method calls in unit test
Contributor
Author
|
@ChaitanyaChawak I fixed the CI error. I dropped python 3.9 and python 3.12 from the matrix. 3.9 is too old for tensorflow 2.15 and 3.12 is too new. what remains is the changelog. I will work on it now. |
Contributor
Author
|
hello @ChaitanyaChawak , I pushed the changelog. all is done on my side. :) |
Contributor
Author
|
actually, I lied. I am checking the API documentation. |
added 4 commits
May 27, 2026 18:08
- Update parameter description for clarity - Update installation guide on environment and dependencies - Update configuration guide section with new schema_mode parameter
17 tasks
ChaitanyaChawak
approved these changes
Jun 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduce schema-driven dataset conversion and domain-specific conversion contexts to support flexible runtime dataset contracts across training, evaluation, and inference workflows.
This PR refactors the dataset conversion pipeline to decouple dataset validation from field-specific processing logic, enabling workflows with different dataset requirements (e.g. SHE inference vs evaluation/training datasets).
Closes #217
Closes #218
What’s changed
Add schema-driven dataset conversion infrastructure
Introduce runtime-selectable dataset modes:
TRAINEVALUATIONINFERENCEAdd dataset schema registry and validation handling
Introduce specialized conversion handlers module
Add
ConversionContextandSEDContextabstractionsRefactor SED processing using domain-specific handler contexts
Remove hardcoded required field assumptions from conversion workflows
Refactor dataset constants to support schema-driven conversion and handler dispatch
Add configurable
schema_modeto inference configurationImprove logging throughout dataset conversion and inference flows
Simplify configuration namespace handling
Refactor TensorFlow conversion tests into schema-based contract tests
Improve typing for lazily initialized adapters and internal pipeline state
How to test / verify
Run training workflow using standard training configuration
Run evaluation workflow using
DatasetMode.EVALUATIONRun inference workflow using
DatasetMode.INFERENCERun inference workflow using reduced datasets containing only:
positionssedsExecute unit and integration tests:
Verify reproducibility of:
Scope
This PR is part of the broader effort to improve runtime flexibility and external pipeline interoperability for WaveDiff and downstream integrations (e.g. SHE workflows).
Changelog
Reviewer Checklist
develop, ormainfor release PRs)ruff)Next Steps / Notes (if applicable)