Skip to content

Add SmartDiskCache module with hash-based persistent caching#49

Open
BitcrushedHeart wants to merge 45 commits into
Nerogar:masterfrom
BitcrushedHeart:SmartCache
Open

Add SmartDiskCache module with hash-based persistent caching#49
BitcrushedHeart wants to merge 45 commits into
Nerogar:masterfrom
BitcrushedHeart:SmartCache

Conversation

@BitcrushedHeart

@BitcrushedHeart BitcrushedHeart commented Apr 6, 2026

Copy link
Copy Markdown
Contributor

SmartDiskCache - Hash-Based Persistent Caching

What This Is

A replacement for 'DiskCache' that makes caching persistent and content-addressed rather than ephemeral. Adding one image to a 100k dataset caches one file, not 100k. Editing one caption recaches one text embedding, not all of them. Moving files between concepts (same content, different path) reuses existing cache via hash matching. Switching between training configs that differ only in non-cache-relevant settings never triggers recaching.

The cache becomes a content-addressed store that grows over time and only rebuilds what's genuinely stale.

How It Works

Hashing

Every source file gets an xxhash64 hash of its contents. xxhash64 is faster than MD5/SHA-256 and has excellent collision resistance for non-cryptographic purposes. The full 64-bit hash is used internally for comparison. Cache filenames use a 12 hex char truncation (48 bits, ~281 trillion possible values) to keep paths manageable.

Image cache files: '{hash12}{resolution}{variation}.pt'
Text cache files: '{hash12}_{variation}.pt'

Validation Flow

Per-file validation runs for each file needed in the current epoch:

  1. EXIST: Does this file have a cache entry for the current modeltype? If not, hash it and check for dedup (same content elsewhere), or build new cache.
  2. EXIST: Does the expected '.pt' file exist on disk? If not, rebuild.
  3. MTIME: Has mtime changed since the cache entry was written? If not, accept (fast path - most files won't have changed).
  4. HASH: Recalculate xxhash64. If hash unchanged (file touched/copied but content identical), accept and update mtime. If hash changed, rebuild.

The mtime check is the fast path. Hash computation only happens when mtime changes. This means validation of a 100k dataset where nothing changed is essentially free - it's 100k 'stat()' calls, no file reads.

Cache Index

Each cache directory ('image/' and 'text/') maintains a 'cache.json' index with per-file entries (filename, hash, mtime, modeltype, resolution, cache_file, cache_version) and a 'hash_index' mapping hashes to lists of filepaths for dedup lookups. The index uses atomic writes (write to '.tmp', backup to '.bak', rename) with crash recovery on startup.

Deduplication

When a new file is encountered, its hash is checked against the 'hash_index'. If a match exists with the same modeltype and resolution, the existing cache entry is reused - no encoding needed. This handles the common case of the same image appearing in multiple concepts.

When one copy of a deduplicated file is edited, it gets a new hash and new cache files. The unedited copy still points to the old cache entry. When all references to a hash are gone, the cache files become eligible for garbage collection.

Sourceless Training

If all necessary training data is embedded in the '.pt' cache files, users can train from cache alone without the source images/text files. A 'sourceless_training' toggle in the config enables this. When active, the dataloader skips file enumeration, loading, and augmentation modules entirely - the pipeline collapses to just '[cache_modules, output_modules]'.

On startup in sourceless mode, 'SmartDiskCache' validates that all cache entries have sufficient 'cache_version', correct 'modeltype', and existing '.pt' files. Clear errors are raised if anything is missing.

This enables dataset sharing without distributing original files. Cached latents can't be decoded back to pixel-space images without the VAE decoder, so this is a one-way transform - useful for privacy-sensitive datasets.

Garbage Collection

A "Clean Cache" button in the UI identifies orphaned cache files (source file no longer exists, or '.pt' files with no 'cache.json' entry) and shows a preview with file counts and sizes before deleting anything. Dedup-shared '.pt' files are preserved as long as at least one source file still references them.

Sample Selection Fix

The SAMPLES balancing strategy now shuffles the full file pool then takes N, rather than taking the first N then shuffling. This gives genuinely random sampling across epochs when using large datasets with sample limits.

What Changed

New File

'src/mgds/pipelineModules/SmartDiskCache.py' - the entire module. 'PipelineModule' + 'SingleVariationRandomAccessPipelineModule', drop-in replacement for 'DiskCache' with additional constructor params ('modeltype', 'source_path_in_name', 'sourceless').

Testing

Test branch: 'SmartcacheTests' - 69 tests covering hashing, cache validation flow, deduplication, atomic writes/crash recovery, garbage collection, sourceless training, sample selection, DiskCache regression, and issue regression scenarios.

Why not replace DiskCache?

While mgds is built for OneTrainer, I have no idea what else could be using mgds - so this allows existing repos to continue using DIskCache, even as OneTrainer shifts to SmartDiskCache - if desired we could raise a depreciation warning when DiskCache is used if this is merged.


Closes #41

Introduces SmartDiskCache as a drop-in replacement for DiskCache with
per-file xxhash64 content validation, content-addressed cache filenames,
a cache.json index with deduplication support, atomic writes with crash
recovery, garbage collection, sourceless training mode, and a sample
selection fix for the SAMPLES balancing strategy.
- rebuild validation status now cleans hash_index before re-queuing,
  matching the behavior of content_changed/resolution_changed/missing_pt
- Remove unused all_input_files set from __refresh_cache
- Store loss_weight, type, name, path, seed from concept dict in .pt
  files at build time (follows existing __cache_version pattern)
- In sourceless mode, reconstruct concept dict from stored metadata
  so OutputPipelineModule can resolve concept.loss_weight
- Add concept to sourceless get_outputs() so pipeline resolution
  finds SmartDiskCache instead of walking back to ConceptPipelineModule
- Bump CACHE_VERSION to 2 (forces cache rebuild for sourceless mode,
  normal mode unaffected)
Call before_cache_fun before falling through to upstream pipeline
modules in get_item, so the model is on the correct device when
re-encoding uncached items at training time.
The real bug was in OneTrainer passing 'prompt_path' (nonexistent)
as source_path_in_name for the text cache, causing every text lookup
to miss. With the correct key ('image_path'), the fallback path
should never be reached after a fresh cache build.
- Add .pt existence check on mtime fast-path to prevent FileNotFoundError
- Replace shutil.move with os.replace for atomic writes on Windows
- Rewrite _load_cache_index with 3-stage fallback (cache.json → .tmp → .bak)
- Extend _index_lock to cover full save operation (write + backup + rename)
- Switch to time-based flush interval (30s) with compact JSON for intermediate flushes
- Cache os.path.realpath once in __init__, use _real_pt_path consistently
- Cache source paths at epoch start, eliminate per-item pipeline traversal
- Load aggregate data into RAM at epoch start, serve from memory in get_item
Shows tqdm progress during the validation loop and aggregate cache
loading so the terminal doesn't appear frozen between phases.
The generator expression caused as_completed to submit futures lazily,
one at a time, preventing the executor from pipelining the next item
while the current one's I/O completes.
BitcrushedHeart and others added 16 commits April 6, 2026 16:20
On repeat runs where nothing changed, cache validation was taking 20+
minutes due to stat-ing every source file individually. This adds a
fast path that checks directory mtimes and spot-checks a sample of
entries, reducing validation to under a second for unchanged datasets.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cache validation was running at the start of every epoch, even when the
same filepaths were being delivered (which is the common case since users
configure repeats rather than custom samples_per_epoch). On larger
datasets the per-file validation loop was noticeable at each epoch
boundary despite no actual dataset change.

Track validated filepaths in a per-process set and short-circuit
_reshuffle_and_prepare when every required path is already in that set
and still present in the on-disk index. Fall through to the existing
fast-validate / full-validate paths otherwise.

Trade-off: within-run edits to source files are no longer detected.
Cross-run detection (via cache.json + fast validation) is unchanged.
Training against a mutating dataset within a single process was never
well-defined anyway.
Fix device mismatch on cache miss during training. Call
before_cache_fun before falling through to upstream pipeline
modules in get_item, so the model is on the correct device
when re-encoding uncached items at training time.

The fallback is reachable whenever individual files fail to
cache (build_failed / missing / hash_failed), so the band-aid
from c22be2f was removed prematurely in 28795b1.
Persist a zero-tensor sentinel during cache validation using any
successful entry as a shape template. On cache miss, return the
sentinel directly instead of re-running upstream encoders.

Rationale: files that fail to cache (build_failed / missing /
hash_failed) leave gaps in the index. At training time the text
encoder is on the temp device (CPU) and bringing it back to GPU
to re-encode a single sample risks both a device mismatch and an
OOM since the main model is already on GPU.

The before_cache_fun re-encode path is kept as a last-resort
fallback for the edge case where no valid entries exist yet
(e.g. caching interrupted before any file succeeded).
When the env var is set, skip per-file mtime/hash/.pt-existence checks
and the upstream _get_resolution_string call (which can trigger per-image
I/O on slow cloud storage). Filepaths already in the on-disk index are
trusted; only missing filepaths are cached. Modeltype mismatch still
raises to prevent silent cross-model cache reuse.

Driven by the --skip-cache-validation CLI flag in OneTrainer/scripts/train.py.
Toggling settings like masked_training between runs adds keys (e.g.
'latent_mask') to split_names/aggregate_names that aren't present in
existing .pt files, so downstream readers (AspectBatchSorting et al)
crashed with KeyError instead of silently dropping the missing field.

Stamp split+aggregate names into cache.json as 'schema' so we can
detect drift on startup. When drift is found, walk every entry, run
only the missing names through the upstream pipeline, and merge them
into the existing .pt (preserving all other keys). Atomic via tmp +
os.replace, parallelised through the existing executor.

_ensure_blank_sentinel now rebuilds when the sentinel doesn't cover
all currently-required keys, and get_item borrows zero-tensors from
the sentinel for any key still missing from a per-file augmentation
failure -- no single bad entry can crash training.
Previous augment-in-place fix re-ran _get_previous_item('latent_mask',
in_index) through the upstream pipeline to backfill missing keys, then
wrote them into the existing .pt next to the already-cached
latent_image. That breaks when toggling settings adds modules to the
upstream chain (e.g. enabling masked_training pulls in
mask_augmentation_modules and changes 'mask' to be cropped alongside
'image'), which can produce a different crop_resolution than the one
stored in the cache. Result: latent_mask written at a shape that
doesn't match the cached latent_image, then collate_fn crashes with
'stack expects each tensor to be equal size' once a batch mixes
samples whose mask shapes diverged.

Switch to invalidate-and-rebuild: when schema drift is detected, drop
every entry from the index, delete the .pt files, and let the
existing build loop rebuild each entry in a single upstream pass so
all keys share the same crop_resolution and shape.

Add a SCHEMA_METHOD marker stamped into cache.json. Caches that were
schema-stamped by the prior augment-based code (schema set,
schema_method unset) are auto-invalidated on the next run so users
who already trained on shape-corrupted .pt files get a clean rebuild
without manually nuking their cache_dir.
Augmenting a cache built under different settings (e.g. masked_training
toggled, which adds mask_augmentation modules to the upstream chain)
re-runs the upstream pipeline for the missing names. The fresh run
can produce a different crop_resolution than the one stored alongside
latent_image, so the augmented latent_mask ends up at a different
spatial shape -- collate_fn then crashes with 'stack expects each
tensor to be equal size' once a batch mixes samples whose mask shapes
diverged.

Fix at the source:
- Per cached entry, derive a reference spatial shape from the
  already-cached latent_image.
- For every target name, recompute via _get_previous_item only when
  the cached value is missing OR its spatial shape mismatches the
  reference. Names that already match are left untouched.
- Force the recomputed value onto the reference shape via bilinear
  interpolation when upstream returns something divergent. The mask
  is approximate when the cache crosses pipelines, but it's much
  cheaper than rebuilding 100k entries from scratch.

Stamp a SCHEMA_METHOD marker into cache.json. Caches stamped by the
prior augment that didn't shape-check (schema set, schema_method
unset/different) are auto re-augmented on the next run, fixing the
already-broken on-disk values without manual cache_dir cleanup.
Pure refactor, no behavior change.

- I001: sort the import block (mgds-internal imports first-party section
  per the config used to lint).
- UP035: import Callable from collections.abc instead of typing.
- UP008: drop redundant super() arguments.
- SIM105: replace try/except/pass with contextlib.suppress.
- SIM118: drop .keys() in 'in dict' membership checks.
- SIM108: collapse if/else into a ternary where it fits.
- SIM113: fold a manual build_count into enumerate(start=1).
- C416: rewrite a list comprehension as list().
- RET503: add an explicit return None on the no-match path.
- RSE102: drop the empty parentheses on raise.
- B007: drop the unused fp/i loop variables.
… dir

SmartDiskCache validation regressed dramatically vs the old DiskCache: a
30k-image cache validated in ~40 minutes instead of ~10. Each entry was
firing 1 getmtime + V isfile syscalls plus two pipeline traversals, all
serial. Under Windows Defender / EDR filters this lulled to 4 it/s.

Five bundled changes reduce a fresh-pipeline validation pass on 30k
images from minutes to seconds:

1. _scan_existing_pt_files(): one os.scandir of the cache dir replaces
   N×V os.path.isfile calls during validation, dedup, and build.
2. _bulk_stat_source_files(): parallel os.scandir per source parent
   dir via the existing executor; harvests mtimes in K syscalls
   (K = #parent dirs) instead of N getmtime calls.
3. Validation loop iterates unique in_index once instead of
   needed_variations × N. _validate_entry is invariant in in_variation;
   the build phase still iterates all V internally.
4. Resolution short-circuit: _get_resolution_string is only called
   when an entry is missing or invalidated, not on every cache hit.
5. Per-watched-file directory fingerprint: replaces the parent-dir
   mtime check in _fast_validate. Touching an unrelated sidecar file
   (caption .txt, mask, .npz) in a watched dir no longer invalidates
   the fast path. Stored as cache_index['watched_fingerprints'];
   legacy caches without the field run one full validation pass to
   write it, then take the fast path on subsequent runs.

Also fixes pre-existing ruff violations in tests/test_smartcache.py
(import sort, unused vars, set comprehensions, zip strict=).

Tests: 15 new behaviour-parity tests (TestBulkScanCorrectness,
TestBulkStatCorrectness, TestResolutionShortCircuit, TestVariationDedup,
TestWatchedFingerprint) plus 3 timing benchmarks. Headline benchmark
on this machine: 200-file cold validation 2.5s, fresh-pipeline warm
fast-validate 177ms, full validation after one file touch 177ms.
Existing tests unchanged; one (test_rebuild_cleans_hash_index) updated
to drive the same code path through file-content change rather than
patching os.path.getmtime which the bulk-stat path no longer uses.

Pre-existing GC tests (test_gc_preview_empty, test_gc_clean) still
fail; the blank_sentinel.pt orphan is created by an unrelated upstream
module and is out of scope for this commit.
The validation loop was still calling _get_resolution_string for every
valid cache entry, which chains AspectBucketing -> CalcAspect ->
LoadImage and opens the source image to read its dimensions. On a 33k
dataset this was the dominant remaining cost (~5 it/s, ~hour-and-a-half
total) even after the bulk-scan fixes — the per-image decode dwarfed
the syscalls we'd already eliminated.

Trust the cached resolution on the happy path. Same contract as the
original DiskCache: bucket config changes require a manual cache clear.
schema_method drift is detected earlier in __refresh_cache via
_detect_cache_schema_drift / _augment_cache_with_missing_names and
remains intact.

The rebuild branch still calls _get_resolution_string for files that
genuinely need rebuilding, which is correct and small in the steady
state.
CACHE_VERSION 2 -> 3. Each entry now stores a ``variants`` dict keyed
by resolution string (e.g. ``"896x640"``) instead of single
``cache_file``/``resolution`` fields. v2 indices migrate in place on
load — no .pt rebuild required.

When AspectBucketing config changes between runs (e.g. user edits
target_resolutions), drift recovery derives the new bucket assignment
for each entry purely from the cached aspect ratio (parse "HxW" -> aspect,
run the same argmin against the new bucket_aspects). Any pre-existing
.pt file matching a derived key is reused; missing keys queue rebuilds
of just that variant. No source images are decoded for unchanged
resolutions.

The image cache thus becomes a multi-resolution store: training at 512
yesterday and 768 today doesn't invalidate the 512 variants — both
coexist. Wired through DataLoaderText2ImageMixin and
StableDiffusionFineTuneVaeDataLoader via ``bucket_method_provider`` and
``rebucket_provider`` callbacks.

Other changes:
- gc_preview/gc_clean walk every variant and honour the v2->v3 migrator.
- blank_sentinel.pt is now correctly recognised as referenced (latent
  bug present pre-CACHE_VERSION 3 too).
- _validate_entry returns 'missing_variant' for variant-level rebuilds
  that preserve the parent entry and other variants.
- AspectBucketing exposes bucket_for_aspect() and
  compute_bucket_method_hash() for the cache to call without re-entering
  the LoadImage chain.
Previously, drift recovery only fired when the cache.json had a
stored ``bucket_method`` AND it differed from the current one. v2
caches migrated to v3 have ``stored == None``, so drift was skipped
and stale variants kept being served unchanged — even when the user's
target_resolution had changed since the cache was originally built.

This manifested as OOM (latents larger than the trainer expected) and
batch shape-stack errors (mixed-resolution caches grouping inconsistently).

Fix: trigger drift recovery whenever ``stored != current`` (treating
``None`` as ''old / unknown''). On the no-change happy path the
recovery is a no-op since aspect math produces the already-cached
variant keys. When keys do differ, existing pre-built variants are
linked in if their .pt files exist on disk, and only entries with no
matching variant trigger rebuilds.

Also bump AspectBucketing's bucket_method version from ``aspect_v1`` to
``aspect_v2`` so users who already validated under v3 (and got an
aspect_v1 hash stamped) re-run drift recovery once to catch any
inconsistencies the original v2 -> v3 migration missed.
Two orthogonal correctness fixes for SmartDiskCache.

schema_keys per variant: each variant now stores the sorted list of
split_names + aggregate_names that were present when its .pt was
written. _validate_entry returns the new 'incomplete_schema' result
when a config change (e.g. enabling masked_training) adds required
keys that the cached .pt doesn't carry, instead of silently letting
sentinel-padded zero tensors leak into training. Stamping happens on
build, dedup, and post-augment so legacy variants get backfilled the
first time they're touched. _ensure_pt_files now reads the existing
.pt to verify schema completeness before reusing it, instead of
re-registering an incomplete file by name match alone.

Sentinel reshape: _load_blank_sentinel returns tensors templated off
some arbitrary entry's spatial shape. When the per-file pad path
fired for an item with a different aspect (portrait sentinel into a
landscape item, or vice versa), the verbatim copy crashed
torch.stack downstream in AspectBatchSorting. The pad path now
detects spatial-dim mismatch via a reference shape from any already-
loaded item tensor and zeros a correctly-shaped tensor instead.
Sidecar caches sometimes track source files that may legitimately
not exist on disk — a mask cache, for instance, keys on
<image>-masklabel.png and falls back to a synthesised white mask
(GenerateImageLike) when the file is absent. Today those entries
hit OSError on every getmtime/hash call and either rebuild every
run (validation path) or never build at all (build path).

The new flag flips three sites to a "trust the cache entry" stance
when the source file is missing AND the entry already exists:

- _validate_entry: when current_mtime is None, pin it to the stored
  entry mtime instead of returning 'rebuild', so the equality branch
  takes over and runs the existing variant/.pt/schema checks.

- _fast_validate spot check: treat OSError on getmtime as "use the
  stored mtime" rather than aborting the fast path.

- Build callback: when the source can't be stat'd, fall back to
  mtime=0 and an xxhash of the filepath bytes so distinct synthetic
  entries don't dedup together via a shared "no hash" sentinel.

The default stays False so the image cache (and every other
existing instance) keeps its current loud-fail behaviour when a
source file vanishes.
TestValidateEntryVariantStatus builds an SDC via __new__ to exercise
_validate_entry without spinning up a full pipeline. The schema-
completeness check added in the prior commit reads self.split_names
/ self.aggregate_names; the bare fixture didn't set them. Stamp
both as empty lists (and tolerate_missing_source=False) so the
schema check returns early via the empty-required short-circuit.
When a cache's source path differs from the image path (e.g. a mask
cache keyed on -masklabel.png), it can register the same set of
resolution-keyed variants as the image cache but pick a different
active key per item. The image cache reads upstream crop_resolution
to choose its variant; the sidecar falls back to "first variant in
dict", and dict order is set at build time and may not reflect the
image's current bucket choice. The result is shape mismatches at
collate -- the image arrives at one resolution, its mask at another.

resolution_from_upstream flips the sidecar to:
  - skip _drift_recovery_pass entirely (its variant-derived aspect
    can disagree with the image's actual aspect when sidecar files
    have different dimensions)
  - in per-entry validation, always call _get_resolution_string and
    use it for the active key, instead of the cached-key fast path
  - relax _get_resolution_string's gate so it works for caches that
    don't aggregate crop_resolution but still need to read it from
    upstream

Performance is fine for the intended use: the upstream walk hits the
image cache's already-loaded aggregate dict (one map lookup), not a
full LoadImage+CalcAspect chain like the image cache itself would
incur. Default stays False so every existing instance behaves
identically.
The previous commit only patched per-entry validation. The fast-
validate and session-skip paths still pre-set _active_key_by_filepath
to the first dict variant via _populate_active_keys, which masked
the upstream-resolution behaviour whenever the cache was already
fully populated and just needed validation.

Two additions:
  - _populate_active_keys is a no-op under resolution_from_upstream;
    let active keys be set lazily so a stale dict ordering can't pin
    the wrong variant for the session.
  - get_item resolves the active key from upstream crop_resolution
    on first read of each filepath when the flag is set, before
    consulting _active_cache_file.

Together these guarantee the sidecar cache always serves the variant
matching upstream's per-item bucket choice, regardless of which
validation path the cache took at startup.
When a sidecar cache (e.g. mask cache) walks upstream for
crop_resolution, the chain may pass through another SmartDiskCache
(image cache) before reaching AspectBucketing. SmartDiskCache is a
SingleVariationRandomAccessPipelineModule, and the walker enforces
that the requested variation matches the module's current_variation.
The previous in_variation=0 callsite worked when crop_resolution
was upstream of any cache (RandomAccessPipelineModule, no variation
check) but breaks the moment a SingleVariation module sits in the
chain — every item raises, the bare except swallows it, every
variant key falls back to NO_RESOLUTION_KEY, and the cache_file
loses its resolution suffix.

This was hidden in the test environment because new pipelines run
at current_variation=0 from epoch 0; the bug surfaced when the user
resumed training mid-run (current_variation=3 on the image cache,
mask cache passing 0).

Fix: dispatch the upstream call at this module's current_variation.
crop_resolution is variation-invariant in the bucketing pipeline,
so the result is identical, but the variation field now matches
whatever the upstream cache is currently set to. Falls back to the
caller's in_variation when current_variation is unset (-1).
SmartDiskCache now accepts an optional extra_watched_paths_in_names list
of in-name fields whose per-sample resolved paths are fingerprinted and
validated alongside the primary source. Touching, adding, or removing a
watched sidecar (e.g. a -masklabel.png that controls the latent_mask
baked into the bundled .pt) invalidates that entry and forces a rebuild
on the next run.

Per-entry validation mirrors the primary-source mtime -> hash escalation,
so touch-only mtime drift with unchanged bytes does NOT trigger a
rebuild. Each entry stores sidecar_mtimes and sidecar_hashes; the
directory fingerprint includes the sidecar basenames so the fast-path
also notices add/remove/touch.

Cache index format is additive (no CACHE_VERSION bump). Legacy entries
without sidecar_mtimes auto-populate from current disk state on first
read; the fast-path is forced to slow-path once when sidecar watching
is configured but entries pre-date the feature, so the new metadata
gets written.

Default extra_watched_paths_in_names=None makes every existing caller
a no-op.

Tests cover: matching mtime, mtime drift + hash match (refreshes
stored mtime), content change, sidecar added, sidecar removed,
end-to-end fast-path fail on edit, end-to-end fast-path stays valid
on touch-only.
Restore the per-epoch random target selection that the old DiskCache had
for free, without paying to build every variant for every file upfront.
Variants accumulate lazily across epochs as rand.choice rotates each
item's target; in steady state validation is a no-op.

Highlights:

- AspectBucketing: per-call ``_target_override`` so SmartDiskCache can pin
  a specific target during a parallel build batch.
  ``variant_key_from_aspect(variation, index, aspect)`` reproduces the
  rand.choice + bucket math given a known aspect ratio, with no upstream
  image decode.

- SmartDiskCache:
  * Per-epoch active key resolution in ``__refresh_cache`` and ``get_item``
    via ``_get_resolution_string`` walked at the current epoch's variation.
  * ``_fast_resolution_string`` uses the entry's stamped
    ``original_resolution`` for exact aspect; lazy-stamps legacy entries
    on first encounter so subsequent epochs don't decode images.
  * Aggregate cache keyed by ``(filepath, variation, in_index)``. The same
    source file referenced across multiple concepts (or under repeats) is
    queried at distinct in_index values; ``rand.choice`` is seeded on
    ``(variation, index)`` so each in_index resolves to a different
    variant's crop_resolution. The old 2-tuple key let the second
    ``_load_aggregate_cache`` write clobber the first, so AspectBatchSorting's
    sort-time crop_resolution diverged from what split-fetch loaded —
    ``torch.stack`` then failed on mismatched latent shapes.
  * Shallow-copy on ``get_item`` aggregate return so the PipelineModule
    walker's later ``item_cache.update`` can't bleed back into the cache.
  * Aggregate always loaded from ``.pt`` for correctness (synth fast path
    disabled until original_resolution is stamped on every entry).
  * Build pass stamps ``entry['original_resolution']`` from the upstream
    walker's cached CalcAspect output (free).
  * Multi-target builds group queued items by AspectBucketing target int
    and run one parallel batch per target with ``_target_override`` held —
    thread-safe since the override is shared state across workers.
  * Session-skip and fast-validate paths bypassed under multi-target
    bucketing so each epoch's start() runs the lazy-build pass.

- ``_drift_recovery_pass`` no longer pins ``_active_key_by_filepath`` to
  the first new key — that biased every image to the smallest target on
  the next run. Variant-dict reordering kept for the text-cache
  any-variant fallback.

Plumbed via the new ``SmartDiskCache(..., aspect_bucketing=...)`` kwarg
populated from OneTrainer's existing ``_aspect_bucketing_for_cache``.
BitcrushedHeart and others added 11 commits May 15, 2026 22:42
_load_aggregate_cache's fast path synthesized crop_resolution by
parsing (h, w) out of the variant key string and never read the .pt.
get_item's split-fetch loaded the .pt and returned its stored
crop_resolution. The two diverge whenever drift recovery linked an
out-of-grid key to an old .pt, _target_int_for_resolution_key
returned an ambiguous target during a multi-target build pass, or
the AspectBucketing config changed between build and read.

Symptom: AspectBatchSorting at sort time bucketed by the synthesized
value while the .pt at fetch time delivered a different shape.
torch.stack then crashed in collate with two latents of different
H/W in the same batch (despite both items individually agreeing
crop_resolution = latent shape).

Fix: stamp the .pt's actual stored crop_resolution onto each
variant index entry. _try_synthesize_aggregate now reads from that
stamp, so agg cache and split-fetch are sourced from the same field
by construction. _load_aggregate_cache's slow path lazy-stamps when
it loads from disk so pre-existing caches migrate themselves on
first read, then _save_cache_index once at the end persists the
stamps for subsequent epochs. _try_dedup propagates the stamp.

The fast-path I/O win is preserved — synth still skips torch.load,
it just reads a dict field instead of parsing a key string. Pre-
stamp caches pay one torch.load per .pt on the first epoch (the
slow path was always reachable there anyway); subsequent epochs
are dict-lookup-only.

Tests: TestSynthesizeAggregateChecksVariantExists added, including
test_stamped_value_overrides_key_string (canonical divergence
repro) and test_load_aggregate_cache_lazy_stamps_legacy_variant
(migration path). 101/101 pass.
…ssues

Validation passed out_variation (epoch counter) to _fast_resolution_string /
_get_resolution_string while get_item passed in_variation, so AspectBucketing
rolled different bucket keys on each epoch. cache.json accumulated ghost
variants that get_item never asked for. Validation now iterates
`for in_variation in range(variations)` per in_index, queueing builds keyed by
(in_index, in_variation). _source_mtimes is populated before the validation
loop (was conditional on the fast-validate fallback path).

Additional fixes:
- Sourceless aggregate cache key shape mismatch — preload wrote (fp, 0)
  while get_item read (fp, 0, in_index); preload now uses the 3-tuple key
  and runs in parallel via _state.executor.
- _compute_watched_fingerprints continues past a single unreadable parent
  instead of returning None and killing the fast path globally.
- _save_cache_index serializes inside the lock, writes outside, so workers
  aren't blocked on disk I/O during large-cache flushes.
- _ensure_blank_sentinel prefers an entry whose variants have schema_keys
  covering required, falling back to any-variant only if no schema-complete
  entry exists.
- gc_preview / gc_clean enumerate .pt variations via itertools.count(1)
  instead of range(1, 100).
- Fast-validate spot-check ceiling raised 50 → 500.
- Targeted exception logging in _get_resolution_string,
  _compute_bucket_method, and _build_cache_entry's torch.load.
- Removed dead _pt_path method (zero callers), dead `in_index is None`
  ternary in _get_extra_paths, dead `if not expected_paths: pass` branch
  in _check_sidecars.

Tests: 111 passed (was 100 + 8 bug-provers + 3 deferred). Adds
TestSourcelessAggregateCache, TestFingerprintPartialSkip,
TestSaveIndexLockDuration, TestGetExtraPathsNoneCheck,
TestCheckSidecarsEmptyExpected, TestGCVariationCount,
TestSentinelTemplateSelection, TestValidationVariationMatchesGetItem.
Includes a stub AspectBucketing for variation-mismatch regression coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-existing mid-file import was the last ruff complaint after the
SmartDiskCache fixes. Safe to move — it's just a constant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same semantics as the OT_SKIP_CACHE_VALIDATION=1 env var, but per-instance
and config-driven so it follows each run's TrainConfig instead of lingering
process state. Trust runs no longer pop last_validated: they neither verify
nor invalidate source state, so the fast-validation token stays valid for
later non-trust runs.
New optional content_key_in_name ctor param (e.g. the final post-augmentation
'prompt' string): when set, each variation's cached payload is registered in
a persistent content_index.json keyed by hash(modeltype + schema + text).
On rebuild, a content hit is served by copying the donor .pt and refreshing
its __concept_* metadata instead of re-running the encoder. Editing one line
of a multi-line caption re-encodes only that line; re-ordering lines costs
zero encodes; identical lines across files and concepts are encoded once for
the whole dataset.

Safety: only write-time-stamped hashes (__content_hash in the .pt) are ever
registered - a hash inferred for a pre-existing .pt could be wrong if the
seed or pipeline layout changed since it was built, and would poison reuse
for other files. Reuse is gated on resolution-less caches (text), so image
and mask variants can never alias across resolutions.

Also:
- before_cache_fun is now deferred to the first variation that actually
  needs an upstream encode, so build passes served entirely from dedup or
  content copies never move the encoder onto the GPU.
- New build_max_workers ctor param gives the build pass a dedicated thread
  pool; the shared pipeline executor is sized for training-time loading and
  starves encoder batching during cache builds.
- xxhash added to requirements.txt (runtime dep already in pyproject).
Two opt-in flags, both default-off (single-request behavior is exactly the
legacy bs=1 forward):

- trim_padding: forward only the real-token prefix and zero re-pad the
  hidden state. With right padding and a causal LM, trailing pad tokens
  cannot influence real positions, so every row a downstream
  PruneMaskedTokens keeps is unchanged. A typical 60-token caption stops
  paying for a 512-token forward.

- batch_collector/max_batch_size: leader-elected collector that gathers
  concurrent get_item calls into one padded batch forward. Forwards are
  inherently serialized (one leader at a time), which also covers the
  transformers check_model_inputs thread-safety bug (transformers#42673)
  structurally. With a layer-offloaded or quantized encoder, N captions now
  share one weight-stream instead of paying it N times.

Equivalence covered by tests with a tiny CPU Qwen3: trimmed and batched
hidden states match the padded bs=1 forward on all real-token rows.
The local line (trust_cache, content-addressed caption reuse, encoder
trim/batching) has been running production training and is the stable
implementation. The remote-only commits 4d67533 (per-(in_index,in_variation)
validation rework + test suite rewrite) and 5b4b840 (import hoist) were
never deployed locally and are superseded wholesale by this line ('ours'
merge: their content is intentionally not applied). The xxhash requirements
addition from 4d67533 was re-applied separately in 29c0648.
…ty, I/O waste

- get_item/_load_aggregate_cache resolve the per-epoch bucket key with the
  epoch variation, matching validation/build, so per-epoch variant rotation
  actually reaches training instead of serving the epoch-0 bucket forever
- Sourceless aggregate cache keyed (fp, 0, group_index) to match get_item's
  lookup; aggregate requests no longer fall through to per-item torch.load
- _BatchCollector retries requests individually when a batched forward
  fails, so one bad caption no longer poisons its batchmates into
  blank-sentinel zeros
- missing_pt/incomplete_schema drop and rebuild only the broken variant,
  preserving the entry and its other still-valid variants
- gc_clean/gc_preview no longer break at the first missing variation
  suffix; higher-numbered valid .pt files survive a gap
- Batched trim_padding zeroes each item's tail past its own effective
  length, matching _encode_single, so cache contents are batch-independent
- _try_content_reuse verifies the donor's stamped __content_hash before
  copying; stale content_index mappings are refused
- PadMaskedTokens raises instead of silently truncating real tokens when
  the sequence exceeds max_length
- variant_key_from_aspect returns None (slow-path fallback) when the
  override-enable read fails instead of assuming the override is off
- cache.json is reloaded only when its on-disk stat changed; blank
  sentinel and content-reuse donors are memoized; index writes happen
  outside _index_lock

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The index-write refactor in 667f6eb introduced both attributes in __init__,
but the TestSynthesizeAggregateChecksVariantExists stub builds its instance
via __new__ and only carried _index_lock, so the lazy-stamp test crashed in
_save_cache_index.
…Llama encoders

The leader-elected BatchCollector (including the per-item retry on batch
failure and batch-isolation semantics from 667f6eb) moves from
EncodeQwenText into mgds.TextEncoderBatching with an opaque result type, so
encoders with non-tensor outputs can share it.

EncodeMistralText (Flux2's Mistral Small 24B, Ernie) and EncodeLlamaText
(HiDream and HunyuanVideo's Llama 8B, including the all-hidden-states list
mode and crop_start) gain opt-in batch_collector/max_batch_size params.
Collector only — no trim_padding: those pipelines cache the full padded
hidden state, and the padded rows can flow to the model unmasked, so their
values must remain the encoder's own outputs.

Both default off; single-request behavior is exactly the legacy bs=1
forward. Equivalence covered by tests with tiny CPU models, including the
HiDream-style all-layers + crop_start configuration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DiskCache Variations

1 participant