Skip to content

did2: retarget conversion pipeline to V_epsilon + active deprecated-family migrators#145

Open
stevevanhooser wants to merge 6 commits into
V2from
claude/nifty-galileo-4ip93w
Open

did2: retarget conversion pipeline to V_epsilon + active deprecated-family migrators#145
stevevanhooser wants to merge 6 commits into
V2from
claude/nifty-galileo-4ip93w

Conversation

@stevevanhooser

Copy link
Copy Markdown
Contributor

Summary

Retargets the did2 did_v1 conversion pipeline from a single flat V_delta/stable directory to the tiered, index.json-based V_epsilon schema set, and implements the fully active conversion of the five families V_epsilon deprecates/re-roots.

This is the DID-matlab half of a 3-repo change (paired with DID-schema and NDI-matlab PRs on the same branch).

Infrastructure (commit 1)

  • did2.schema.cache index mode. Pointed at a set-version root containing index.json, the cache resolves classes by class_name to their tier folder (stable/draft/deprecated) and reads the set-version string from schema_version_value. Flat mode (a bare tier dir, no index) is retained for back-compat and defaults to V_delta. defaultSchemaPath now points at the V_epsilon root.
  • Single source of truth for the version string. universalRenames and v1_to_v2 stamp document_class.schema_version from the active cache version (override via SchemaVersion); the idempotency short-circuit keys off it. Retargeting a future set is a pin change, not a code change.
  • Migrator fan-out. A migrator may return a cell array of bodies, so one did_v1 document can mint several V_epsilon documents. ensureClassBlocks fills unset class_version/schema_version so fan-out migrators can emit minimal bodies. Fan-out is all-or-nothing (any failure quarantines the source).

Active migrators (commit 2)

New shared helper did2.convert.interactionCommon (identity carryover incl. session_id, depends_on building, ontology_term/concentration/volume composites, mixture-table parsing, minting utc_reference companions). Migrators:

did_v1 V_epsilon target(s)
treatment split → temperature_/procedural_/environmental_manipulation (+ companion generic_scalar_observation); non-manipulation/unresolvable records quarantine (report-only-first)
treatment_drug injection (kind=drug) (+ utc_reference)
virus_injection injection (kind=virus) (+ utc_reference)
treatment_transfer biological_transfer (+ utc_reference)
subject_group subject (is_group=true, legacy id carried) (+ group_assignment)
stimulus_bath re-rooted under bath

Timing is synthesized into time_reference companions only when recoverable; otherwise the dependency is omitted and flagged for curator backfill, never faked. Ontology nodes absent in the legacy shapes are left blank for backfill (draft-tier soak). Nothing legacy is dropped silently.

Not done / caveats

  • Not executed. No MATLAB/Octave was available in the authoring environment, so the migrators are reviewed-by-reading, not run. They need a test pass against a real corpus — especially timestamp formatting, composite sub-field shapes, and the treatment keyword routing (which the proposals want run report-only-first; the migrator quarantines anything it can't confidently route).

Routing/field-mapping specs are in the paired DID-schema PR under schemas/V_epsilon/conversions/from_did_v1/.

https://claude.ai/code/session_01F737KLMaAzotg1uXeZFQuX


Generated by Claude Code

claude added 2 commits June 11, 2026 23:30
Make the V2 schema cache and converter resolve a tiered, index.json-based
schema set instead of a single flat V_delta/stable directory, and carry
the set-version string as a single source of truth so the did_v1 -> V2
pipeline targets V_epsilon by default.

Schema cache (did2.schema.cache):
- Add index mode: when schemaPath holds an index.json, resolve classes
  by class_name through the index to their tier folder
  (stable/draft/deprecated), and read the set-version string from the
  index's schema_version_value. Flat mode (a bare tier dir, no index)
  is retained for back-compat and defaults the version to V_delta.
- Point defaultSchemaPath at the V_epsilon set-version root.
- buildBlankDocument and the registry loader use the resolved version
  and index instead of hardcoded 'V_delta' / flat-dir assumptions.

Converter (did2.convert):
- universalRenames stamps document_class.schema_version from the active
  cache version (override via SchemaVersion); the renames themselves are
  set-version-independent.
- v1_to_v2 resolves the target version once and supports migrator
  fan-out: a migrator may return a cell array of bodies so one v1
  document can mint several V2 documents (treatment split, subject_group
  -> subject + group_assignment, interactions that mint time_reference).
  ensureClassBlocks now also fills unset class_version/schema_version so
  fan-out migrators can emit minimal bodies. Fan-out is all-or-nothing.

https://claude.ai/code/session_01F737KLMaAzotg1uXeZFQuX
Implement the 'fully active' conversion of the five legacy families that
V_epsilon deprecates/re-roots, plus a shared helper. Each migrator builds
target documents in the new subject_interaction families and fans out
companion documents (time_reference, group_assignment, scalar
observations) through the dispatcher's new multi-body support.

New shared helper (did2.convert.interactionCommon):
- identity carryover (base id/session_id/datestamp), depends_on
  building, ontology_term/concentration/volume composites, mixture_table
  parsing, and minting utc_reference companions.

New / reworked migrators:
- treatment        -> split into temperature_/procedural_/environmental_
                      manipulation on name+ontology (incl. the Dab
                      'Target Location' target_structure case); numeric
                      values preserved as companion generic_scalar_
                      observations; non-manipulation/unresolvable records
                      quarantined (report-only-first).
- treatment_drug   -> injection (kind=drug); mixture_table -> mixture;
                      location -> target_structure; onset/offset ->
                      companion utc_reference.
- virus_injection  -> injection (kind=virus); construct(+diluent) ->
                      mixture; location -> target_structure; date ->
                      companion approximate utc_reference.
- treatment_transfer -> biological_transfer; method_* -> procedure;
                      entity_* -> entity; recipient_id -> subject_id;
                      donor_id carried; global-clock ts -> utc_reference.
- subject_group    -> subject(is_group=true) carrying the legacy id;
                      member -> companion group_assignment.
- stimulus_bath    -> re-rooted under bath (mixture/location/kind),
                      epochid superclass dropped.

Timing/ontology gaps follow the draft-tier soak strategy: timestamps are
synthesized only when recoverable, ontology nodes left blank for curator
backfill; nothing legacy is dropped silently.

https://claude.ai/code/session_01F737KLMaAzotg1uXeZFQuX
Comment thread src/did/+did2/+convert/+migrators/treatment.m Fixed
Comment thread src/did/+did2/+convert/+migrators/treatment_transfer.m Fixed
Comment thread src/did/+did/document.m
if numel(matches)>0
notfound = 0;
d{i} = getfield(did_document_obj.document_properties.depends_on(matches(1)),'value');
d{i} = did.document.i_readDependencyTarget( ...
Comment thread src/did/+did2/+convert/+migrators/treatment_transfer.m Fixed
% Box the handle so the closure sees mutations, then null it out once
% we've explicitly closed (preventing a double-close in the cleanup hook).
handle = struct('id', dbid);
cleanup = onCleanup(@() closeIfOpen(handle)); %#ok<NASGU>
'Verbose', options.Verbose);

db = did2.database.sqlitedb(dstPath, 'SchemaCache', options.SchemaCache);
cleanup = onCleanup(@() db.close()); %#ok<NASGU>
error('did2:convert:readerFailed', ...
'Failed to open quarantine file "%s" for writing.', quarantineFile);
end
cleanup = onCleanup(@() fclose(fid)); %#ok<NASGU>
function [names, counts] = bumpClassCounter(names, counts, name)
idx = find(strcmp(names, name), 1);
if isempty(idx)
names{end+1} = name; %#ok<AGROW>
idx = find(strcmp(names, name), 1);
if isempty(idx)
names{end+1} = name; %#ok<AGROW>
counts(end+1) = 1; %#ok<AGROW>
end
desired = sort({obj.queryableScalarColumns.column});
current = obj.currentQueryableColumns();
if isequal(sort(current(:)'), desired(:)')
Fix the 'Run symmetry tests' failure (MATLAB:heterogeneousStrucAssignment
in did.util.compareDatabaseSummary>normalizeDeps): the comparator only
handled the legacy {name,value} depends_on shape and crashed on the
V_delta/V_epsilon {name,document_id} shape via a whole-struct assignment
between dissimilar structures. Rebuild a homogeneous {name,value} array
field-by-field, mapping document_id/value/id onto value so summaries
compare equal regardless of entry-key shape.

Also clear two MATLAB Code Analyzer findings in the new migrators:
- treatment.m: classifyTreatment's unused ontology-node arg -> ~.
- treatment_transfer.m: replace deprecated datestr with datetime and drop
  the redundant pre-assignment in datenumToISO.

https://claude.ai/code/session_01F737KLMaAzotg1uXeZFQuX
@stevevanhooser stevevanhooser changed the base branch from main to V2 June 12, 2026 00:17
claude added 3 commits June 12, 2026 00:35
Point the did2 test harness at the V_epsilon schema set (the PR's target)
and update the migrator unit tests whose outputs changed under the active
conversion:

- test-code.yml: DID_SCHEMA_PATH -> did-schema/schemas/V_epsilon (the
  set-version root, index mode), so did2 resolves the new draft classes
  (injection, bath, procedural_manipulation, ...).
- testCorpus20211116 / testCorpusPRED: resolveSchemaPath fallback ->
  V_epsilon root.
- testMigrators: rewrite the two legacy 'treatment passthrough' cases to
  the V_epsilon split (procedural_manipulation + unresolved->quarantine);
  schema_version stamp expectation -> V_epsilon.

Follow-up (next commits, converged against CI output): testConvertV1ToV2
schema_version assertions + stimulus_bath expected shape, and the
testCorpus* count/quarantine expectations.
…e 2)

- testConvertV1ToV2: flip the schema_version *default-stamp* assertions
  to V_epsilon (the active set version the pipeline now stamps) and stamp
  the already-target skeleton (makeVDeltaSkeleton -> makeTargetSkeleton)
  V_epsilon so the idempotency short-circuit fires. Leaves the
  intentional cases unchanged: the 'did_v1' preserve test and the
  stale-base-schema_version migration (value copied from base).
- testFromV1Database: switch the synthetic sample bodies off 'treatment'
  (now actively split/quarantined) to 'daqreader_ndr' (identity
  passthrough), so the orchestration tests (roundtrip/overwrite/
  quarantine sidecar) exercise the dispatcher mechanics, not treatment
  routing.
…o V_epsilon (phase 3)

- interactionCommon.concentration now populates a canonical sub-field
  (molar / grams_per_liter / mass_fraction / volume_fraction) when the
  source unit is recognised, restoring the fidelity the pre-rework
  stimulus_bath migrator had and extending it to injection/bath/
  treatment_drug mixtures. Unknown units stay source-only.
- stimulus_bath migrator: restore the missing-block error (malformed v1
  with no stimulus_bath block quarantines, matching the prior contract).
- testMigrators stimulus_bath cases: re-root assertions onto the
  V_epsilon blocks (bath.location, pharmacological_manipulation.mixture),
  keep the molar-scaling/unknown-unit expectations, and replace the
  empty-table 'zero-length' case with the required-non-empty backfill
  placeholder behavior.

https://claude.ai/code/session_01F737KLMaAzotg1uXeZFQuX
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants