Skip to content

feat: Recurse into nested BIDS datasets beyond derivatives/, and normalize invalid DatasetType#390

Draft
yarikoptic wants to merge 5 commits into
bids-standard:mainfrom
yarikoptic:bf-recurse
Draft

feat: Recurse into nested BIDS datasets beyond derivatives/, and normalize invalid DatasetType#390
yarikoptic wants to merge 5 commits into
bids-standard:mainfrom
yarikoptic:bf-recurse

Conversation

@yarikoptic
Copy link
Copy Markdown
Contributor

Co-Authored-By: Claude Code 2.1.116 / Claude Opus 4.7 (1M context) noreply@anthropic.com

Summary

Two orthogonal-but-related fixes around nested BIDS datasets and the
DatasetType field:

  1. Recurse into nested BIDS datasets under rawbids/ and sourcedata/,
    not just derivatives/
    . The previous recursion logic scanned only
    derivatives/<name>/dataset_description.json. It silently ignored
    two other shapes that appear in real datasets:

    • Single nested: the container directory itself is a BIDS dataset
      (e.g. rawbids/dataset_description.json, which was added as a
      study-level container in
      bids-standard/bids-specification#2191,
      and the historical sourcedata/raw/dataset_description.json form).
    • Multiple nested: sub-directories of the container are each BIDS
      datasets (e.g. derivatives/fmriprep/,
      sourcedata/ds00003/dataset_description.json as used by OpenNeuro
      study-style layouts).

    With this change, -r now discovers and validates all of these.
    Bugs planted inside a rawbids/ or sourcedata/ds00003/ subdataset
    are caught; previously they went silently undetected.

  2. Normalize an invalid DatasetType for internal rule lookups.
    The setter previously defaulted only when DatasetType was falsy; a
    truthy-but-out-of-enum value (e.g. "Raw" — capitalization typo in
    bids-examples/proj-nipoppy/rawbids/dataset_description.json) made
    rules.directories[DatasetType] resolve to undefined, cascading
    into spurious NOT_INCLUDED errors on legitimate sub-*/ses-*/
    subjects on top of the legitimate JSON_SCHEMA_VALIDATION_ERROR.

    Fall back to the spec default for any out-of-enum value. Valid enum
    is pulled from schema.objects.metadata.DatasetType.enum (no
    hardcoded list). Normalization is done on a shallow copy so the
    memoized JSON is untouched — JSON schema validation still flags the
    enum violation exactly once.

User-visible behavior changes

  • -r now recurses into nested BIDS datasets found under any of
    derivatives/, rawbids/, or sourcedata/, at either the container's
    own level or one level down.
  • Text output label "Derivative:" is now "Nested dataset:".
  • The derivativesSummary field on ValidationResult is preserved for
    API stability but now covers all nested BIDS datasets (not only
    derivatives). Documented in the validate JSDoc.
  • Datasets with an out-of-enum DatasetType still get the
    JSON_SCHEMA_VALIDATION_ERROR they already got, but no longer
    cascade into spurious NOT_INCLUDED on legitimate subjects.

Not in this PR (flagged for follow-up)

  • Compliance checks currently unenforced: a raw (or derivative) root
    containing a nested BIDS dataset under rawbids/ or sourcedata/,
    and a non-study root containing rawbids/ at all. Likely best paired
    with an upstream schema proposal to declare nested-BIDS capability on
    directory rules (e.g. nested_bids: single | multiple | none), so
    the validator stops hardcoding the container set.
    (@yarikoptic has even missed that and needs to get back to that PR to see if we discussed this or not to be also part of 'derivatives' dataset type!)
  • Pre-existing: filememoize (src/utils/memoize.ts) keys by
    (file.parent.path, file.path), which collides across rerooted trees.
    When multiple nested BIDS datasets are validated in parallel under
    -r, whichever loadJSON runs first wins and others reuse its
    content. Affects multi-derivative recursion too. Tracked separately.

Test plan

  • deno test --allow-env --allow-read --allow-write --allow-run --allow-net src/ — 336/336 pass.
  • Synthetic dataset reproducing the nested-rawbids + invalid-DatasetType case:
    • -r surfaces the planted bug inside rawbids/sub-002/bogus_file.txt.
    • -r surfaces the planted bug inside sourcedata/ds00003/sub-A/stray.dat.
    • JSON_SCHEMA_VALIDATION_ERROR still fires for DatasetType: "Raw".
    • No spurious NOT_INCLUDED on legitimate sub-001.
  • Existing derivatives recursion behavior unchanged (covered by
    src/tests/local/derivatives.test.ts).

@yarikoptic yarikoptic requested review from effigies and rwblair April 21, 2026 17:31
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 83.33333% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.33%. Comparing base (5c0b1cd) to head (62fed66).

Files with missing lines Patch % Lines
src/validators/bids.ts 86.11% 3 Missing and 2 partials ⚠️
src/schema/context.ts 70.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #390      +/-   ##
==========================================
- Coverage   90.45%   90.33%   -0.13%     
==========================================
  Files          51       51              
  Lines        4316     4345      +29     
  Branches      721      726       +5     
==========================================
+ Hits         3904     3925      +21     
- Misses        323      328       +5     
- Partials       89       92       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread src/schema/context.ts
// spurious NOT_INCLUDED errors; fall back to the default as above. If
// the schema isn't loaded (test construction without a schema) we can't
// check — leave the value untouched.
// @ts-expect-error `metadata` is not declared on SchemaObjects; access is dynamic
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know if I should reduce commenting here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely a lot. I don't think we need the branching logic, though, validTypes.includes(undefined) will also return False.

// shallow copy to avoid polluting memoize cache
value = {...value}
const validTypes = this.schema?.objects?.metadata?.DatasetType?.enum as Array
// Unknown types (including capitalizations) fall back to defaults
// Per spec: 'the default value is `"raw"`'.
// Before DatasetType was defined, `GeneratedBy` was used as a derivatives indicator
if (validTypes && !validTypes.includes(value.DatasetType as string)) {
  value.DatasetType = value.GeneratedBy ? 'derivative' : 'raw'
}
this.#dataset_description = value

I do wonder if we should drop the GeneratedBy behavior, since the spec does not say to do this, and GeneratedBy is now valid in raw datasets.

Copy link
Copy Markdown
Contributor Author

@yarikoptic yarikoptic Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I thought about that too! Moreover GeneratedBy is imho very applicable to "raw" datasets as they are typically generated by some tool + human(s) and thus we did provide it for "raw" (made it explicit!) BIDS dataset and I think frankly it should even be RECOMMENDED for "raw" (not study) : https://github.com/OpenNeuroDatasets/ds005256/blob/main/dataset_description.json#L37

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so let's kick it out in a separate PR?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, can be a separate PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we want this one to land and not wait -- just did right here.

Comment thread src/validators/bids.ts Outdated
Comment thread src/validators/bids.ts
`sourcedata/ds00003/dataset_description.json`). The text output
section previously labeled "Derivative:" is now "Nested dataset:".
The `derivativesSummary` field on `ValidationResult` is preserved
for API stability but now covers all nested BIDS datasets. See
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this made me pause -- should we separate out into sourcedatasetsSummary (would include rawbids/) from derivativesSummary? it would add a field but overall be "compatible" if anyone uses that, right @effigies ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decided to give claude a try for this, pushing... moving to draft - will need to test etc

yarikoptic and others added 3 commits April 23, 2026 09:14
Recursive validation (`-r`) previously recognized only datasets under
`derivatives/<name>/`.  Nested BIDS datasets placed under `rawbids/`
(new study-level container, bids-standard/bids-specification#2191) or
`sourcedata/` (historically used for `sourcedata/raw/` or
OpenNeuro-style `sourcedata/ds00003/`) were silently ignored, so
internal bugs in those subdatasets never surfaced.

Generalize the detection:

- For each of `derivatives/`, `rawbids/`, `sourcedata/`, check both
  forms:
  - Immediate: the container itself contains `dataset_description.json`
    (single nested dataset, e.g. `rawbids/dataset_description.json`).
  - Subfolder: one or more immediate children contain
    `dataset_description.json` (e.g. `derivatives/fmriprep/`,
    `sourcedata/ds00003/`).
- Rename the local collection to `bidsNestedDatasets` /
  `nestedDatasetsSummary`; keep the public `derivativesSummary` field
  for API stability but widen its documented semantics.
- Update the text output label "Derivative:" -> "Nested dataset:".

Co-Authored-By: Claude Code 2.1.116 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `dataset_description` setter applied the default `DatasetType`
only when the field was falsy (missing / empty).  A truthy but
out-of-enum value (e.g. `"Raw"` — capitalization typo) was kept
as-is and then used to index `rules.directories[DatasetType]`,
which returned `undefined`, causing the opaque-directory set to be
empty and `findDirRuleMatches` to match no rule.  That cascaded
into spurious `NOT_INCLUDED` errors on legitimate `sub-*/ses-*`
subjects inside such datasets, on top of the legitimate
`JSON_SCHEMA_VALIDATION_ERROR` on the enum violation.

Treat any `DatasetType` that is not one of the spec enum values
the same as a missing value — fall back to `"raw"`, per the spec's
backward-compatibility default.  Source the valid enum from
`schema.objects.metadata.DatasetType.enum` so this stays in sync
with the spec; when the schema isn't loaded (test-only
construction without a schema), leave a truthy value untouched
rather than guess.

Do the normalization on a shallow copy of the incoming
`dataset_description` so we don't mutate the memoized
`dataset_description.json` object.  JSON schema validation still
sees the original value and reports the enum violation.

Co-Authored-By: Claude Code 2.1.116 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…urcesSummary

The prior commit folded every nested BIDS dataset — whether a
derivative or a "source" under `rawbids/` / `sourcedata/` — into a
single `derivativesSummary` field, with a caveat in the JSDoc that
the name was retained for API stability only.  That is confusing
for consumers and hides useful structure: downstream tools often
want to act on derivatives and sources differently (e.g. gating
release on derivative errors alone, or surfacing source-dataset
issues separately in a UI).

Split the two into distinct fields:

- `derivativesSummary` — datasets found under `derivatives/`
  (unchanged semantics from prior main releases).
- `sourcesSummary` — datasets found under `rawbids/` or
  `sourcedata/` (new).

Both are populated only when `-r` is set, and each maps the nested
dataset's relative path to its own `ValidationResult`.
`detectErrors` now descends into both.

Text output renders the two groups as `Derivative:` and `Source:`
sections.

Co-Authored-By: Claude Code 2.1.116 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yarikoptic yarikoptic marked this pull request as draft April 24, 2026 03:31
yarikoptic and others added 2 commits April 27, 2026 18:02
`consoleFormat` previously printed the root dataset's issues and
summary first, followed by Derivative: and Source: sections.  When
many nested datasets were present, the root's status scrolled off
the top of the terminal and a `tail` of the output left the user
looking at the last nested dataset rather than the dataset they
asked to validate.

Reorder bottom-up so the root dataset comes last:

- `Source: /sourcedata/...` first, then `Source: /rawbids/...`
- `Derivative: /derivatives/...`
- `Root dataset:` (newly labeled) — the parent's issues and Summary

After the Root section, embed a `Nested datasets:` roll-up
tabulating count, errors, and warnings for Sources and
Derivatives, so a `tail` shows both the parent's status and an
aggregate health view at a glance.

Within Sources, `/sourcedata/*` keys come before `/rawbids/*` keys
(alphabetical within each group).  Within Derivatives, alphabetical.

Co-Authored-By: Claude Code 2.1.116 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- detectErrors: assert errors in nested sourcesSummary are collected,
  mirroring the existing derivativesSummary case.
- BIDSContextDataset opaqueDirectories: assert that an invalid
  DatasetType (e.g., "Raw"/"Derivative") is normalized via the schema's
  DatasetType.enum, falling back to derivative iff GeneratedBy is set.

Co-Authored-By: Claude Code 2.1.126 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants