Implement configurable cache eviction policies in chunk cache#820
Implement configurable cache eviction policies in chunk cache#820yingDev wants to merge 10 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new runtime configuration knob to control how the on-disk chunk cache evicts entries when over capacity, and wires that policy through cache initialization, eviction selection, and cache instance de-duping.
Changes:
- Introduce
chunk_cache.eviction_policyconfig (env varHF_XET_CHUNK_CACHE_EVICTION_POLICY) with allowed valuesrandom|lru|lfu. - Implement LRU/LFU eviction behavior in
DiskCacheby tracking per-item access stats and selecting eviction candidates accordingly. - Update cache manager de-dupe key to include cache capacity and eviction policy, and add tests validating LRU/LFU behavior.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| xet_runtime/src/config/xet_config.rs | Extends config roundtrip/all-keys tests to include chunk_cache.eviction_policy. |
| xet_runtime/src/config/groups/chunk_cache.rs | Adds eviction_policy as a validated ConfigEnum setting for the chunk cache group. |
| xet_client/src/lib.rs | Updates crate docs to reflect configurable cache eviction. |
| xet_client/src/chunk_cache/mod.rs | Introduces CacheEvictionPolicy enum and parsing from XetConfig. |
| xet_client/src/chunk_cache/disk.rs | Implements policy-driven eviction plus access-stat tracking; adds LRU/LFU tests. |
| xet_client/src/chunk_cache/cache_manager.rs | Changes cache de-duping key to include directory, size, and eviction policy. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <copilot@github.com>
…ion logic Co-authored-by: Copilot <copilot@github.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 4 total unresolved issues (including 3 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 972ea5e. Configure here.
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
|
@yingDev The chunk cache is disabled by default right now in all clients that use xet-core (git-xet binary, hf-xet Python, hf-xet Rust) - are you using it? |
I mainly use xet-core through hf-csi-driver -> hf-mount. Based on my understanding and testing, hf-mount uses the chunk cache by default, with an opt-out via |
seanses
left a comment
There was a problem hiding this comment.
Thanks for contributing!
I have a very high level concern that if this is truly helpful compared against random eviction:
- This only holds the LRU state in-memory so it's only tracking the usage status within one process session. This usually means downloading a single file or repo. In real world (HF repos) we don't see high (or even any) data duplication within a single file or a single repo. We have ~20% overall data duplication but mostly come from cross repo data similarity. This concern may alleviate a bit if this feature only targets mount over git-based repos (expected long running session, checking out files with minimal changes through the git history).
- How large does the cache need to be for LRU to be better than random eviction? With a rather low cache hit rate (~20%, maybe higher if targeting a very specific scenario), LRU suffers from accumulating enough information if the working set significantly exceeds cache size, and easily falls back to the same eviction decision as a random policy would make.
Given the above concern, I would want to see benchmarks over real data that this yields superior performance, along with disk usage & memory usage metrics to draw a conclusion.
|
That’s a fair concern. I agree this should not be presented as generally better than random eviction without real-data benchmarks. My main target is the hf-mount use case. There, I think the chunk cache mostly behaves like a file cache: when the same file is accessed repeatedly, LRU can help avoid performance variance caused by random eviction. For short-lived hf-xet/git-xet download sessions, I agree LRU may not have enough local history to outperform random eviction, so keeping For multi-process scenarios, the current in-memory LRU state may indeed be insufficient. I’ll try improving the implementation by persisting the LRU access state locally. |
…marking - Introduced `DEFAULT_CHUNK_CACHE_ACCESS_UPDATE_INTERVAL_NS` constant to define the default access update interval for the chunk cache. - Updated the configuration group to include `access_update_interval_ns` with the ability to set it via the environment variable `HF_XET_CHUNK_CACHE_ACCESS_UPDATE_INTERVAL_NS`. - Enhanced the `XetConfig` tests to validate the new access update interval configuration. - Added a new benchmarking tool `chunk_cache_lru_bench.rs` to evaluate the performance of the chunk cache with different eviction policies and access update intervals. - Implemented `CacheMetadataDb` to manage access timestamps and ensure efficient updates based on the defined access update interval. - Added tests for access update intervals to verify correct behavior for recording access based on the configured interval. Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
|
I think that in hf-mount's use case, the main purpose of the chunk cache is not deduplication, but to ensure consistent local file i/o performance. So LRU might be a valuable alternative to the random eviction policy. I have changed the LRU chunk cache impl to be SQLite-based. My knowledge here is very limited, so this may not be the best approach. If xet-core cannot provide an LRU chunk cache, I may have to switch from hf-mount (hf-csi-driver) to gcsfuse, which according to the documentation uses LRU eviction by default. $ cargo bench -p xet-client --bench chunk_cache_lru_bench -- --entry-size 4MiB --nproc 2 |

Summary
This PR adds configurable chunk-cache eviction behavior. The cache now supports
randomeviction by default and an opt-inlrupolicy throughchunk_cache.eviction_policy/HF_XET_CHUNK_CACHE_EVICTION_POLICY.Changes
CacheEvictionPolicytype with parsing/display support forrandomandlru.chunk_cache.eviction_policy, including config round-trip and key-list test coverage.DiskCacheeviction selection so cache state can evict according to the configured policy.CacheAccessStatefor LRU bookkeeping, tracking cache hits/inserts and keeping item indices correct when entries are removed withswap_remove.CacheManagerreuse by cache directory while validating the eviction policy before returning or creating a cache.Testing