Skip to content

Integrate SmartDiskCache for hash-based persistent caching#1411

Open
BitcrushedHeart wants to merge 17 commits into
Nerogar:masterfrom
BitcrushedHeart:SmartCache
Open

Integrate SmartDiskCache for hash-based persistent caching#1411
BitcrushedHeart wants to merge 17 commits into
Nerogar:masterfrom
BitcrushedHeart:SmartCache

Conversation

@BitcrushedHeart

@BitcrushedHeart BitcrushedHeart commented Apr 6, 2026

Copy link
Copy Markdown
Contributor

SmartDiskCache Integration

What This Is

Wires OneTrainer into the new 'SmartDiskCache' module from the companion mgds PR (Nerogar/mgds#49). The cache becomes persistent and content-addressed. It grows over time and only rebuilds what's genuinely stale, rather than wiping and rebuilding every time a file changes.

What Changed

Config

'sourceless_training' field added to TrainConfig with migration (migration_10). Default 'False'. 'clear_cache_before_training' default changed to False since SmartCache makes forced rebuilds unnecessary in most cases.

UI

  • Sourceless Training toggle in the Data tab - trains from cached .pt files without source images/text
  • Clean Cache button in the Data tab. shows a preview of orphaned cache files (count + MB) before deleting anything, handles both text and image cache directories
  • Updated clear_cache_before_training tooltip to reflect that SmartCache validates incrementally and detects model type changes automatically

Dataloaders

All dataloaders that previously used DiskCache now use SmartDiskCache through DataLoaderText2ImageMixin._cache_modules(). The mixin passes modeltype, source_path_in_name, and sourceless to the SmartDiskCache constructor.

When 'sourceless_training' and 'latent_caching' are both enabled, '_create_dataset()' short-circuits to '[cache_modules, output_modules]', skipping file enumeration, loading, augmentation, and preparation modules entirely.

Interruptible Caching

Pressing "Stop Training" during caching now finishes the current file, saves the cache index, and stops gracefully. The next run picks up where it left off.

GenericTrainer

'__clear_cache()' now prints a message explaining that SmartCache makes clearing unnecessary. The wipe logic is preserved (deletes image/, text/, and epoch-* dirs) but the default is off.


Closes #280
Closes #109
Closes #1357

Replaces DiskCache with SmartDiskCache in all dataloaders, adds
sourceless_training config field with UI toggle, adds Clean Cache button
with preview dialog, updates clear_cache_before_training default to
False, and adds xxhash to requirements.
SmartCache validates incrementally and detects model type changes
automatically, so the old warning about disabling cache clearing is
no longer accurate.
SmartDiskCache import was placed after CollectPaths/DecodeVAE instead of
in alphabetical order after SingleAspectCalculation.
Text encoder training requires re-tokenizing prompts from source files,
which are not available in sourceless mode. Raise a clear error at
dataset creation time rather than failing mid-training.
- Fix source_path_in_name: prompt_path -> image_path for text cache
- Add stop_check_fun to SmartDiskCache for interruptible caching
- Catch CachingStoppedException in trainer epoch loop
- Closes Nerogar#109
- Text cache now validates against sample_prompt_path instead of image_path
- Clean button disabled while training is running to prevent concurrent access
Upstream mgds SmartCache added f65c2de 'Add fast validation to skip
expensive per-file cache checks', replacing the 20+ min full stat
walk with a directory-mtime + sampled spot-check path that returns
in under a second on unchanged datasets.
Upstream mgds SmartCache now caches validated source filepaths in a
per-process set and short-circuits start-of-epoch validation when every
required path is already in that set. Before, even with the fast-validate
path available, each epoch still re-stat'd the dataset. After, only the
first epoch validates; every epoch after that returns immediately.
Pulls in the SmartDiskCache change that backfills missing
split/aggregate names (e.g. 'latent_mask') into existing .pt files
when settings like masked_training are toggled, instead of crashing
downstream readers with KeyError. Old caches keep working without
a full rebuild.
Replaces the previous 905efb2 augment-in-place with invalidate-and-
rebuild. The augment path could write latent_mask at a shape that
didn't match the cached latent_image (mask_augmentation modules
added by enabling masked_training change crop_resolution), causing
collate_fn to fail with 'stack expects each tensor to be equal size'
on the first batch. Rebuilding the affected entries fresh produces
all keys in one upstream pass so shapes stay consistent. The new
mgds also auto-detects caches stamped by the prior augment code
(via SCHEMA_METHOD marker) and rebuilds them on the next run.
…augment)

Reverts the pin to mgds 51b3f19 (rebuild-on-schema-drift) which was a
non-starter on big caches -- 100k entries means an unacceptable full
VAE re-encode. Switches to mgds bfb3544 which keeps the augment-in-
place strategy but fixes the shape-mismatch bug at source: per cached
entry, augmented values are forced onto the spatial shape of the
already-cached latent_image (bilinear interpolation when upstream
returns a divergent crop_resolution). Existing caches whose
latent_mask was written shape-divergently by the previous augment
get re-augmented automatically via the bumped SCHEMA_METHOD marker.
@dxqb dxqb self-requested a review May 10, 2026 17:06
@dxqb dxqb linked an issue May 15, 2026 that may be closed by this pull request
@dxqb

dxqb commented May 15, 2026

Copy link
Copy Markdown
Collaborator

does it close #1357 ?

@BitcrushedHeart

Copy link
Copy Markdown
Contributor Author

does it close #1357 ?

Yes, it hashes the file, checks if that hash exists, and then skips, so a caption of 'dog' could match 1,000,000 images or 1 with a single .pt file.

…hing)

Wires the new mgds SmartCache features into the text caching path:

- content_key_in_name='prompt' on the text SmartDiskCache: identical
  caption lines are encoded once and reused across variations, files and
  concepts. Editing one line of a multi-line caption re-encodes only that
  line; re-ordering lines is free; a bulk edit appending the same line to
  every caption file encodes it once for the whole dataset.
- Z-Image: trim_padding + batch_collector on EncodeQwenText and a matching
  text cache build worker pool when latent_caching is enabled. Captions no
  longer pay for a full 512-token forward, and up to 8 captions share one
  forward (one weight-stream under layer offload). OT_TEXT_CACHE_BATCH=1
  restores serial bs=1 encoding.
…am, HunyuanVideo)

mgds side: the encoder batch collector moved to mgds.TextEncoderBatching and
EncodeMistralText/EncodeLlamaText gained batch_collector support, with
per-item retry when a batched forward fails so one bad caption cannot poison
its batchmates.

OneTrainer wiring (text_encode_batch_size shared in the mixin, env knob
OT_TEXT_CACHE_BATCH, everything gated on latent_caching):

- Qwen-Image: trim_padding + batch collector + build workers. Same proof as
  Z-Image - the pipeline prunes masked hidden-state rows before caching, and
  crop_start composes with trim (head slice vs tail skip).
- Flux2 (24B Mistral dev / Qwen3 klein), HiDream (8B Llama), HunyuanVideo
  (Llama): batch collector + build workers, no trim - these pipelines cache
  full padded hidden states, so padded rows must remain encoder outputs.
- HiDream/HunyuanVideo CLIP and T5 encoders get apply_thread_safe_forward so
  the widened build pool can drive them from multiple threads
  (transformers#42673); their Llama forwards serialize through the collector.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants