did2: retarget conversion pipeline to V_epsilon + active deprecated-family migrators#145
Open
stevevanhooser wants to merge 6 commits into
Open
did2: retarget conversion pipeline to V_epsilon + active deprecated-family migrators#145stevevanhooser wants to merge 6 commits into
stevevanhooser wants to merge 6 commits into
Conversation
Make the V2 schema cache and converter resolve a tiered, index.json-based schema set instead of a single flat V_delta/stable directory, and carry the set-version string as a single source of truth so the did_v1 -> V2 pipeline targets V_epsilon by default. Schema cache (did2.schema.cache): - Add index mode: when schemaPath holds an index.json, resolve classes by class_name through the index to their tier folder (stable/draft/deprecated), and read the set-version string from the index's schema_version_value. Flat mode (a bare tier dir, no index) is retained for back-compat and defaults the version to V_delta. - Point defaultSchemaPath at the V_epsilon set-version root. - buildBlankDocument and the registry loader use the resolved version and index instead of hardcoded 'V_delta' / flat-dir assumptions. Converter (did2.convert): - universalRenames stamps document_class.schema_version from the active cache version (override via SchemaVersion); the renames themselves are set-version-independent. - v1_to_v2 resolves the target version once and supports migrator fan-out: a migrator may return a cell array of bodies so one v1 document can mint several V2 documents (treatment split, subject_group -> subject + group_assignment, interactions that mint time_reference). ensureClassBlocks now also fills unset class_version/schema_version so fan-out migrators can emit minimal bodies. Fan-out is all-or-nothing. https://claude.ai/code/session_01F737KLMaAzotg1uXeZFQuX
Implement the 'fully active' conversion of the five legacy families that
V_epsilon deprecates/re-roots, plus a shared helper. Each migrator builds
target documents in the new subject_interaction families and fans out
companion documents (time_reference, group_assignment, scalar
observations) through the dispatcher's new multi-body support.
New shared helper (did2.convert.interactionCommon):
- identity carryover (base id/session_id/datestamp), depends_on
building, ontology_term/concentration/volume composites, mixture_table
parsing, and minting utc_reference companions.
New / reworked migrators:
- treatment -> split into temperature_/procedural_/environmental_
manipulation on name+ontology (incl. the Dab
'Target Location' target_structure case); numeric
values preserved as companion generic_scalar_
observations; non-manipulation/unresolvable records
quarantined (report-only-first).
- treatment_drug -> injection (kind=drug); mixture_table -> mixture;
location -> target_structure; onset/offset ->
companion utc_reference.
- virus_injection -> injection (kind=virus); construct(+diluent) ->
mixture; location -> target_structure; date ->
companion approximate utc_reference.
- treatment_transfer -> biological_transfer; method_* -> procedure;
entity_* -> entity; recipient_id -> subject_id;
donor_id carried; global-clock ts -> utc_reference.
- subject_group -> subject(is_group=true) carrying the legacy id;
member -> companion group_assignment.
- stimulus_bath -> re-rooted under bath (mixture/location/kind),
epochid superclass dropped.
Timing/ontology gaps follow the draft-tier soak strategy: timestamps are
synthesized only when recoverable, ontology nodes left blank for curator
backfill; nothing legacy is dropped silently.
https://claude.ai/code/session_01F737KLMaAzotg1uXeZFQuX
| if numel(matches)>0 | ||
| notfound = 0; | ||
| d{i} = getfield(did_document_obj.document_properties.depends_on(matches(1)),'value'); | ||
| d{i} = did.document.i_readDependencyTarget( ... |
| % Box the handle so the closure sees mutations, then null it out once | ||
| % we've explicitly closed (preventing a double-close in the cleanup hook). | ||
| handle = struct('id', dbid); | ||
| cleanup = onCleanup(@() closeIfOpen(handle)); %#ok<NASGU> |
| 'Verbose', options.Verbose); | ||
|
|
||
| db = did2.database.sqlitedb(dstPath, 'SchemaCache', options.SchemaCache); | ||
| cleanup = onCleanup(@() db.close()); %#ok<NASGU> |
| error('did2:convert:readerFailed', ... | ||
| 'Failed to open quarantine file "%s" for writing.', quarantineFile); | ||
| end | ||
| cleanup = onCleanup(@() fclose(fid)); %#ok<NASGU> |
| function [names, counts] = bumpClassCounter(names, counts, name) | ||
| idx = find(strcmp(names, name), 1); | ||
| if isempty(idx) | ||
| names{end+1} = name; %#ok<AGROW> |
| idx = find(strcmp(names, name), 1); | ||
| if isempty(idx) | ||
| names{end+1} = name; %#ok<AGROW> | ||
| counts(end+1) = 1; %#ok<AGROW> |
| end | ||
| desired = sort({obj.queryableScalarColumns.column}); | ||
| current = obj.currentQueryableColumns(); | ||
| if isequal(sort(current(:)'), desired(:)') |
Fix the 'Run symmetry tests' failure (MATLAB:heterogeneousStrucAssignment
in did.util.compareDatabaseSummary>normalizeDeps): the comparator only
handled the legacy {name,value} depends_on shape and crashed on the
V_delta/V_epsilon {name,document_id} shape via a whole-struct assignment
between dissimilar structures. Rebuild a homogeneous {name,value} array
field-by-field, mapping document_id/value/id onto value so summaries
compare equal regardless of entry-key shape.
Also clear two MATLAB Code Analyzer findings in the new migrators:
- treatment.m: classifyTreatment's unused ontology-node arg -> ~.
- treatment_transfer.m: replace deprecated datestr with datetime and drop
the redundant pre-assignment in datenumToISO.
https://claude.ai/code/session_01F737KLMaAzotg1uXeZFQuX
Point the did2 test harness at the V_epsilon schema set (the PR's target) and update the migrator unit tests whose outputs changed under the active conversion: - test-code.yml: DID_SCHEMA_PATH -> did-schema/schemas/V_epsilon (the set-version root, index mode), so did2 resolves the new draft classes (injection, bath, procedural_manipulation, ...). - testCorpus20211116 / testCorpusPRED: resolveSchemaPath fallback -> V_epsilon root. - testMigrators: rewrite the two legacy 'treatment passthrough' cases to the V_epsilon split (procedural_manipulation + unresolved->quarantine); schema_version stamp expectation -> V_epsilon. Follow-up (next commits, converged against CI output): testConvertV1ToV2 schema_version assertions + stimulus_bath expected shape, and the testCorpus* count/quarantine expectations.
…e 2) - testConvertV1ToV2: flip the schema_version *default-stamp* assertions to V_epsilon (the active set version the pipeline now stamps) and stamp the already-target skeleton (makeVDeltaSkeleton -> makeTargetSkeleton) V_epsilon so the idempotency short-circuit fires. Leaves the intentional cases unchanged: the 'did_v1' preserve test and the stale-base-schema_version migration (value copied from base). - testFromV1Database: switch the synthetic sample bodies off 'treatment' (now actively split/quarantined) to 'daqreader_ndr' (identity passthrough), so the orchestration tests (roundtrip/overwrite/ quarantine sidecar) exercise the dispatcher mechanics, not treatment routing.
…o V_epsilon (phase 3) - interactionCommon.concentration now populates a canonical sub-field (molar / grams_per_liter / mass_fraction / volume_fraction) when the source unit is recognised, restoring the fidelity the pre-rework stimulus_bath migrator had and extending it to injection/bath/ treatment_drug mixtures. Unknown units stay source-only. - stimulus_bath migrator: restore the missing-block error (malformed v1 with no stimulus_bath block quarantines, matching the prior contract). - testMigrators stimulus_bath cases: re-root assertions onto the V_epsilon blocks (bath.location, pharmacological_manipulation.mixture), keep the molar-scaling/unknown-unit expectations, and replace the empty-table 'zero-length' case with the required-non-empty backfill placeholder behavior. https://claude.ai/code/session_01F737KLMaAzotg1uXeZFQuX
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Retargets the
did2did_v1 conversion pipeline from a single flatV_delta/stabledirectory to the tiered,index.json-based V_epsilon schema set, and implements the fully active conversion of the five families V_epsilon deprecates/re-roots.This is the DID-matlab half of a 3-repo change (paired with DID-schema and NDI-matlab PRs on the same branch).
Infrastructure (commit 1)
did2.schema.cacheindex mode. Pointed at a set-version root containingindex.json, the cache resolves classes byclass_nameto their tier folder (stable/draft/deprecated) and reads the set-version string fromschema_version_value. Flat mode (a bare tier dir, no index) is retained for back-compat and defaults toV_delta.defaultSchemaPathnow points at theV_epsilonroot.universalRenamesandv1_to_v2stampdocument_class.schema_versionfrom the active cache version (override viaSchemaVersion); the idempotency short-circuit keys off it. Retargeting a future set is a pin change, not a code change.ensureClassBlocksfills unsetclass_version/schema_versionso fan-out migrators can emit minimal bodies. Fan-out is all-or-nothing (any failure quarantines the source).Active migrators (commit 2)
New shared helper
did2.convert.interactionCommon(identity carryover incl.session_id, depends_on building, ontology_term/concentration/volume composites, mixture-table parsing, mintingutc_referencecompanions). Migrators:treatmenttemperature_/procedural_/environmental_manipulation(+ companiongeneric_scalar_observation); non-manipulation/unresolvable records quarantine (report-only-first)treatment_druginjection(kind=drug) (+utc_reference)virus_injectioninjection(kind=virus) (+utc_reference)treatment_transferbiological_transfer(+utc_reference)subject_groupsubject(is_group=true, legacy id carried) (+group_assignment)stimulus_bathbathTiming is synthesized into
time_referencecompanions only when recoverable; otherwise the dependency is omitted and flagged for curator backfill, never faked. Ontology nodes absent in the legacy shapes are left blank for backfill (draft-tier soak). Nothing legacy is dropped silently.Not done / caveats
treatmentkeyword routing (which the proposals want run report-only-first; the migrator quarantines anything it can't confidently route).Routing/field-mapping specs are in the paired DID-schema PR under
schemas/V_epsilon/conversions/from_did_v1/.https://claude.ai/code/session_01F737KLMaAzotg1uXeZFQuX
Generated by Claude Code