feat: Recurse into nested BIDS datasets beyond derivatives/, and normalize invalid DatasetType#390
feat: Recurse into nested BIDS datasets beyond derivatives/, and normalize invalid DatasetType#390yarikoptic wants to merge 5 commits into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #390 +/- ##
==========================================
- Coverage 90.45% 90.33% -0.13%
==========================================
Files 51 51
Lines 4316 4345 +29
Branches 721 726 +5
==========================================
+ Hits 3904 3925 +21
- Misses 323 328 +5
- Partials 89 92 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| // spurious NOT_INCLUDED errors; fall back to the default as above. If | ||
| // the schema isn't loaded (test construction without a schema) we can't | ||
| // check — leave the value untouched. | ||
| // @ts-expect-error `metadata` is not declared on SchemaObjects; access is dynamic |
There was a problem hiding this comment.
let me know if I should reduce commenting here
There was a problem hiding this comment.
It's definitely a lot. I don't think we need the branching logic, though, validTypes.includes(undefined) will also return False.
// shallow copy to avoid polluting memoize cache
value = {...value}
const validTypes = this.schema?.objects?.metadata?.DatasetType?.enum as Array
// Unknown types (including capitalizations) fall back to defaults
// Per spec: 'the default value is `"raw"`'.
// Before DatasetType was defined, `GeneratedBy` was used as a derivatives indicator
if (validTypes && !validTypes.includes(value.DatasetType as string)) {
value.DatasetType = value.GeneratedBy ? 'derivative' : 'raw'
}
this.#dataset_description = valueI do wonder if we should drop the GeneratedBy behavior, since the spec does not say to do this, and GeneratedBy is now valid in raw datasets.
There was a problem hiding this comment.
yeah, I thought about that too! Moreover GeneratedBy is imho very applicable to "raw" datasets as they are typically generated by some tool + human(s) and thus we did provide it for "raw" (made it explicit!) BIDS dataset and I think frankly it should even be RECOMMENDED for "raw" (not study) : https://github.com/OpenNeuroDatasets/ds005256/blob/main/dataset_description.json#L37
There was a problem hiding this comment.
so let's kick it out in a separate PR?
There was a problem hiding this comment.
Sure, can be a separate PR.
There was a problem hiding this comment.
since we want this one to land and not wait -- just did right here.
| `sourcedata/ds00003/dataset_description.json`). The text output | ||
| section previously labeled "Derivative:" is now "Nested dataset:". | ||
| The `derivativesSummary` field on `ValidationResult` is preserved | ||
| for API stability but now covers all nested BIDS datasets. See |
There was a problem hiding this comment.
this made me pause -- should we separate out into sourcedatasetsSummary (would include rawbids/) from derivativesSummary? it would add a field but overall be "compatible" if anyone uses that, right @effigies ?
There was a problem hiding this comment.
decided to give claude a try for this, pushing... moving to draft - will need to test etc
Recursive validation (`-r`) previously recognized only datasets under `derivatives/<name>/`. Nested BIDS datasets placed under `rawbids/` (new study-level container, bids-standard/bids-specification#2191) or `sourcedata/` (historically used for `sourcedata/raw/` or OpenNeuro-style `sourcedata/ds00003/`) were silently ignored, so internal bugs in those subdatasets never surfaced. Generalize the detection: - For each of `derivatives/`, `rawbids/`, `sourcedata/`, check both forms: - Immediate: the container itself contains `dataset_description.json` (single nested dataset, e.g. `rawbids/dataset_description.json`). - Subfolder: one or more immediate children contain `dataset_description.json` (e.g. `derivatives/fmriprep/`, `sourcedata/ds00003/`). - Rename the local collection to `bidsNestedDatasets` / `nestedDatasetsSummary`; keep the public `derivativesSummary` field for API stability but widen its documented semantics. - Update the text output label "Derivative:" -> "Nested dataset:". Co-Authored-By: Claude Code 2.1.116 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `dataset_description` setter applied the default `DatasetType` only when the field was falsy (missing / empty). A truthy but out-of-enum value (e.g. `"Raw"` — capitalization typo) was kept as-is and then used to index `rules.directories[DatasetType]`, which returned `undefined`, causing the opaque-directory set to be empty and `findDirRuleMatches` to match no rule. That cascaded into spurious `NOT_INCLUDED` errors on legitimate `sub-*/ses-*` subjects inside such datasets, on top of the legitimate `JSON_SCHEMA_VALIDATION_ERROR` on the enum violation. Treat any `DatasetType` that is not one of the spec enum values the same as a missing value — fall back to `"raw"`, per the spec's backward-compatibility default. Source the valid enum from `schema.objects.metadata.DatasetType.enum` so this stays in sync with the spec; when the schema isn't loaded (test-only construction without a schema), leave a truthy value untouched rather than guess. Do the normalization on a shallow copy of the incoming `dataset_description` so we don't mutate the memoized `dataset_description.json` object. JSON schema validation still sees the original value and reports the enum violation. Co-Authored-By: Claude Code 2.1.116 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…urcesSummary The prior commit folded every nested BIDS dataset — whether a derivative or a "source" under `rawbids/` / `sourcedata/` — into a single `derivativesSummary` field, with a caveat in the JSDoc that the name was retained for API stability only. That is confusing for consumers and hides useful structure: downstream tools often want to act on derivatives and sources differently (e.g. gating release on derivative errors alone, or surfacing source-dataset issues separately in a UI). Split the two into distinct fields: - `derivativesSummary` — datasets found under `derivatives/` (unchanged semantics from prior main releases). - `sourcesSummary` — datasets found under `rawbids/` or `sourcedata/` (new). Both are populated only when `-r` is set, and each maps the nested dataset's relative path to its own `ValidationResult`. `detectErrors` now descends into both. Text output renders the two groups as `Derivative:` and `Source:` sections. Co-Authored-By: Claude Code 2.1.116 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`consoleFormat` previously printed the root dataset's issues and summary first, followed by Derivative: and Source: sections. When many nested datasets were present, the root's status scrolled off the top of the terminal and a `tail` of the output left the user looking at the last nested dataset rather than the dataset they asked to validate. Reorder bottom-up so the root dataset comes last: - `Source: /sourcedata/...` first, then `Source: /rawbids/...` - `Derivative: /derivatives/...` - `Root dataset:` (newly labeled) — the parent's issues and Summary After the Root section, embed a `Nested datasets:` roll-up tabulating count, errors, and warnings for Sources and Derivatives, so a `tail` shows both the parent's status and an aggregate health view at a glance. Within Sources, `/sourcedata/*` keys come before `/rawbids/*` keys (alphabetical within each group). Within Derivatives, alphabetical. Co-Authored-By: Claude Code 2.1.116 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- detectErrors: assert errors in nested sourcesSummary are collected, mirroring the existing derivativesSummary case. - BIDSContextDataset opaqueDirectories: assert that an invalid DatasetType (e.g., "Raw"/"Derivative") is normalized via the schema's DatasetType.enum, falling back to derivative iff GeneratedBy is set. Co-Authored-By: Claude Code 2.1.126 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Code 2.1.116 / Claude Opus 4.7 (1M context) noreply@anthropic.com
Summary
Two orthogonal-but-related fixes around nested BIDS datasets and the
DatasetTypefield:Recurse into nested BIDS datasets under
rawbids/andsourcedata/,not just
derivatives/. The previous recursion logic scanned onlyderivatives/<name>/dataset_description.json. It silently ignoredtwo other shapes that appear in real datasets:
(e.g.
rawbids/dataset_description.json, which was added as astudy-level container in
bids-standard/bids-specification#2191,
and the historical
sourcedata/raw/dataset_description.jsonform).datasets (e.g.
derivatives/fmriprep/,sourcedata/ds00003/dataset_description.jsonas used by OpenNeurostudy-style layouts).
With this change,
-rnow discovers and validates all of these.Bugs planted inside a
rawbids/orsourcedata/ds00003/subdatasetare caught; previously they went silently undetected.
Normalize an invalid
DatasetTypefor internal rule lookups.The setter previously defaulted only when
DatasetTypewas falsy; atruthy-but-out-of-enum value (e.g.
"Raw"— capitalization typo inbids-examples/proj-nipoppy/rawbids/dataset_description.json) maderules.directories[DatasetType]resolve toundefined, cascadinginto spurious
NOT_INCLUDEDerrors on legitimatesub-*/ses-*/subjects on top of the legitimate
JSON_SCHEMA_VALIDATION_ERROR.Fall back to the spec default for any out-of-enum value. Valid enum
is pulled from
schema.objects.metadata.DatasetType.enum(nohardcoded list). Normalization is done on a shallow copy so the
memoized JSON is untouched — JSON schema validation still flags the
enum violation exactly once.
User-visible behavior changes
-rnow recurses into nested BIDS datasets found under any ofderivatives/,rawbids/, orsourcedata/, at either the container'sown level or one level down.
"Derivative:"is now"Nested dataset:".derivativesSummaryfield onValidationResultis preserved forAPI stability but now covers all nested BIDS datasets (not only
derivatives). Documented in the
validateJSDoc.DatasetTypestill get theJSON_SCHEMA_VALIDATION_ERRORthey already got, but no longercascade into spurious
NOT_INCLUDEDon legitimate subjects.Not in this PR (flagged for follow-up)
raw(orderivative) rootcontaining a nested BIDS dataset under
rawbids/orsourcedata/,and a non-
studyroot containingrawbids/at all. Likely best pairedwith an upstream schema proposal to declare nested-BIDS capability on
directory rules (e.g.
nested_bids: single | multiple | none), sothe validator stops hardcoding the container set.
(@yarikoptic has even missed that and needs to get back to that PR to see if we discussed this or not to be also part of 'derivatives' dataset type!)
filememoize(src/utils/memoize.ts) keys by(file.parent.path, file.path), which collides across rerooted trees.When multiple nested BIDS datasets are validated in parallel under
-r, whicheverloadJSONruns first wins and others reuse itscontent. Affects multi-derivative recursion too. Tracked separately.
Test plan
deno test --allow-env --allow-read --allow-write --allow-run --allow-net src/— 336/336 pass.DatasetTypecase:-rsurfaces the planted bug insiderawbids/sub-002/bogus_file.txt.-rsurfaces the planted bug insidesourcedata/ds00003/sub-A/stray.dat.JSON_SCHEMA_VALIDATION_ERRORstill fires forDatasetType: "Raw".NOT_INCLUDEDon legitimatesub-001.src/tests/local/derivatives.test.ts).