Skip to content

Implement configurable cache eviction policies in chunk cache#820

Open
yingDev wants to merge 10 commits into
huggingface:mainfrom
yingDev:feat-lru-lfu
Open

Implement configurable cache eviction policies in chunk cache#820
yingDev wants to merge 10 commits into
huggingface:mainfrom
yingDev:feat-lru-lfu

Conversation

@yingDev

@yingDev yingDev commented Apr 27, 2026

Copy link
Copy Markdown

Summary

This PR adds configurable chunk-cache eviction behavior. The cache now supports random eviction by default and an opt-in lru policy through chunk_cache.eviction_policy / HF_XET_CHUNK_CACHE_EVICTION_POLICY.

Changes

  • Adds a CacheEvictionPolicy type with parsing/display support for random and lru.
  • Adds runtime config support for chunk_cache.eviction_policy, including config round-trip and key-list test coverage.
  • Refactors DiskCache eviction selection so cache state can evict according to the configured policy.
  • Introduces a separate CacheAccessState for LRU bookkeeping, tracking cache hits/inserts and keeping item indices correct when entries are removed with swap_remove.
  • Keeps random eviction lightweight by skipping access-state initialization and access tracking.
  • Preserves CacheManager reuse by cache directory while validating the eviction policy before returning or creating a cache.

Testing

  • Adds coverage for access-state initialization behavior.
  • Adds LRU eviction tests to verify recently used entries are retained.
  • Adds LRU index-maintenance coverage for swapped cache entries during eviction.

Copilot AI review requested due to automatic review settings April 27, 2026 09:47

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new runtime configuration knob to control how the on-disk chunk cache evicts entries when over capacity, and wires that policy through cache initialization, eviction selection, and cache instance de-duping.

Changes:

  • Introduce chunk_cache.eviction_policy config (env var HF_XET_CHUNK_CACHE_EVICTION_POLICY) with allowed values random|lru|lfu.
  • Implement LRU/LFU eviction behavior in DiskCache by tracking per-item access stats and selecting eviction candidates accordingly.
  • Update cache manager de-dupe key to include cache capacity and eviction policy, and add tests validating LRU/LFU behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
xet_runtime/src/config/xet_config.rs Extends config roundtrip/all-keys tests to include chunk_cache.eviction_policy.
xet_runtime/src/config/groups/chunk_cache.rs Adds eviction_policy as a validated ConfigEnum setting for the chunk cache group.
xet_client/src/lib.rs Updates crate docs to reflect configurable cache eviction.
xet_client/src/chunk_cache/mod.rs Introduces CacheEvictionPolicy enum and parsing from XetConfig.
xet_client/src/chunk_cache/disk.rs Implements policy-driven eviction plus access-stat tracking; adds LRU/LFU tests.
xet_client/src/chunk_cache/cache_manager.rs Changes cache de-duping key to include directory, size, and eviction policy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread xet_client/src/chunk_cache/disk.rs Outdated
Comment thread xet_client/src/chunk_cache/disk.rs Outdated
Comment thread xet_client/src/chunk_cache/disk.rs Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Comment thread xet_client/src/chunk_cache/disk.rs
Co-authored-by: Copilot <copilot@github.com>
Comment thread xet_client/src/chunk_cache/disk.rs Outdated
Comment thread xet_client/src/chunk_cache/disk.rs
…ion logic

Co-authored-by: Copilot <copilot@github.com>
Comment thread xet_client/src/chunk_cache/cache_manager.rs
Comment thread xet_client/src/chunk_cache/disk.rs

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 972ea5e. Configure here.

Comment thread xet_client/src/chunk_cache/disk.rs Outdated
xx and others added 2 commits April 28, 2026 21:56
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
@rajatarya

Copy link
Copy Markdown
Collaborator

@yingDev The chunk cache is disabled by default right now in all clients that use xet-core (git-xet binary, hf-xet Python, hf-xet Rust) - are you using it?

@yingDev

yingDev commented Apr 30, 2026

Copy link
Copy Markdown
Author

@yingDev The chunk cache is disabled by default right now in all clients that use xet-core (git-xet binary, hf-xet Python, hf-xet Rust) - are you using it?

I mainly use xet-core through hf-csi-driver -> hf-mount. Based on my understanding and testing, hf-mount uses the chunk cache by default, with an opt-out via --no-disk-cache; it is attached to unbounded/sequential xet streams, while bounded range reads skip it.

@seanses seanses left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing!

I have a very high level concern that if this is truly helpful compared against random eviction:

  1. This only holds the LRU state in-memory so it's only tracking the usage status within one process session. This usually means downloading a single file or repo. In real world (HF repos) we don't see high (or even any) data duplication within a single file or a single repo. We have ~20% overall data duplication but mostly come from cross repo data similarity. This concern may alleviate a bit if this feature only targets mount over git-based repos (expected long running session, checking out files with minimal changes through the git history).
  2. How large does the cache need to be for LRU to be better than random eviction? With a rather low cache hit rate (~20%, maybe higher if targeting a very specific scenario), LRU suffers from accumulating enough information if the working set significantly exceeds cache size, and easily falls back to the same eviction decision as a random policy would make.

Given the above concern, I would want to see benchmarks over real data that this yields superior performance, along with disk usage & memory usage metrics to draw a conclusion.

@yingDev

yingDev commented May 1, 2026

Copy link
Copy Markdown
Author

That’s a fair concern. I agree this should not be presented as generally better than random eviction without real-data benchmarks.

My main target is the hf-mount use case. There, I think the chunk cache mostly behaves like a file cache: when the same file is accessed repeatedly, LRU can help avoid performance variance caused by random eviction.

For short-lived hf-xet/git-xet download sessions, I agree LRU may not have enough local history to outperform random eviction, so keeping random as the default and making LRU opt-in still seems safer.

For multi-process scenarios, the current in-memory LRU state may indeed be insufficient. I’ll try improving the implementation by persisting the LRU access state locally.

@yingDev yingDev marked this pull request as draft May 1, 2026 08:05
@yingDev yingDev marked this pull request as ready for review May 1, 2026 08:05
@yingDev yingDev marked this pull request as draft May 1, 2026 08:05
xx and others added 2 commits May 3, 2026 01:36
…marking

- Introduced `DEFAULT_CHUNK_CACHE_ACCESS_UPDATE_INTERVAL_NS` constant to define the default access update interval for the chunk cache.
- Updated the configuration group to include `access_update_interval_ns` with the ability to set it via the environment variable `HF_XET_CHUNK_CACHE_ACCESS_UPDATE_INTERVAL_NS`.
- Enhanced the `XetConfig` tests to validate the new access update interval configuration.
- Added a new benchmarking tool `chunk_cache_lru_bench.rs` to evaluate the performance of the chunk cache with different eviction policies and access update intervals.
- Implemented `CacheMetadataDb` to manage access timestamps and ensure efficient updates based on the defined access update interval.
- Added tests for access update intervals to verify correct behavior for recording access based on the configured interval.

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
@yingDev yingDev marked this pull request as ready for review May 2, 2026 19:50
@yingDev

yingDev commented May 2, 2026

Copy link
Copy Markdown
Author

I think that in hf-mount's use case, the main purpose of the chunk cache is not deduplication, but to ensure consistent local file i/o performance. So LRU might be a valuable alternative to the random eviction policy.

I have changed the LRU chunk cache impl to be SQLite-based. My knowledge here is very limited, so this may not be the best approach. If xet-core cannot provide an LRU chunk cache, I may have to switch from hf-mount (hf-csi-driver) to gcsfuse, which according to the documentation uses LRU eviction by default.

$ cargo bench -p xet-client --bench chunk_cache_lru_bench -- --entry-size 4MiB --nproc 2
Chunk cache read/write benchmark

Workload
  Entry size:        4.00 MiB
  Warm items:        128
  Fresh write items: 128
  Read rounds:       5
  Processes:         2
  Cache size:        6.02 GiB
  Eviction cache:    512.00 MiB

LRU access update intervals
  lru/record-access/every 0 ns (record every access)
  lru/record-access/skip u64::MAX ns (skip repeated measured accesses)

Cache roots
  lru/record-access/every /tmp/chunk-cache-lru-record-access-every-bench-X2g9pr
  lru/record-access/skip /tmp/chunk-cache-lru-record-access-skip-bench-qeUnPU
  random                 /tmp/chunk-cache-random-bench-RtFoSy

Results
Case                    Operation                     Ops       Elapsed      Throughput       Ops/sec       Latency
-------------------------------------------------------------------------------------------------------------------
lru/record-access/every  read hit                     1280    693.381 ms   7384.11 MiB/s       1846.03  541.70 us/op
lru/record-access/every  duplicate put                1280       1.269 s   4035.99 MiB/s       1009.00  991.08 us/op
lru/record-access/every  fresh put (no eviction)       256    448.767 ms   2281.81 MiB/s        570.45   1.753 ms/op
lru/record-access/every  fresh put (eviction)          256    399.417 ms   2563.74 MiB/s        640.93   1.560 ms/op
lru/record-access/skip  read hit                     1280    576.228 ms   8885.37 MiB/s       2221.34  450.18 us/op
lru/record-access/skip  duplicate put                1280    995.564 ms   5142.81 MiB/s       1285.70  777.78 us/op
lru/record-access/skip  fresh put (no eviction)       256    662.239 ms   1546.27 MiB/s        386.57   2.587 ms/op
lru/record-access/skip  fresh put (eviction)          256    458.594 ms   2232.91 MiB/s        558.23   1.791 ms/op
random                  read hit                     1280    796.476 ms   6428.32 MiB/s       1607.08  622.25 us/op
random                  duplicate put                1280       1.315 s   3894.39 MiB/s        973.60   1.027 ms/op
random                  fresh put (no eviction)       256    606.339 ms   1688.83 MiB/s        422.21   2.369 ms/op
random                  fresh put (eviction)          256    401.017 ms   2553.51 MiB/s        638.38   1.566 ms/op

lru/record-access/every vs lru/record-access/skip
Operation                     Throughput ratio         Latency ratio
----------------------------------------------------------------------
read hit                                 0.83x                 1.20x
duplicate put                            0.78x                 1.27x
fresh put (no eviction)                  1.48x                 0.68x
fresh put (eviction)                     1.15x                 0.87x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants