feat(wasm): make hf-xet compile for wasm32-unknown-unknown by assafvayner · Pull Request #841 · huggingface/xet-core

assafvayner · 2026-05-12T21:11:29Z

Closes #840

Summary

Makes hf-xet (xet_pkg) and its streaming download + upload data-prep paths compile and run on wasm32-unknown-unknown, and reworks wasm/hf_xet_wasm/ into a single unified example/smoke-test JS-binding crate (cdylib) that exposes both the download and upload flows from one wasm module. The earlier split into separate hf_xet_wasm_download/ and hf_xet_wasm_upload/ crates was reverted (see ff0c46b8) — one wasm module is enough for the CI smokes and easier for the example pages to share.

The existing thin published crate wasm/hf_xet_thin_wasm/ is unchanged in scope but follows the new shared wasm-bindgen pinning.

Two layers of changes

1. Compile compatibility (commits 12274c0 → 7e3c692): cfg-gated tokio dep split, task_runtime.rs WASM bridge variants, xet_pkg::legacy and _blocking/path-based methods gated to non-wasm, cache gating in xet_client.

2. Streaming runtime compatibility (commits 5b7c26e → 28b9265): make the entire streaming download path !Send-tolerant so it actually executes on WASM (where reqwest's futures are !Send).

A follow-up simplification pass (65ab8aba) deduplicates the wasm/native split where the bodies were identical: TaskRuntime::bridge_async{,_finalizing} are written once via a MaybeSend shim, XetSessionBuilder::build and TranslatorConfig::new are single fns with a cfg'd binding, the wasm shard dedup cache is a plain bounded FIFO, and the ci-smoke suite runs from one shared harness.

Key changes (runtime compatibility refactor)

tokio_with_wasm dependency: added to [workspace.dependencies] and to xet_pkg/xet_data/xet_runtime/xet_client wasm-target deps. On native, call sites use real tokio; on wasm, use tokio_with_wasm::alias as tokio redirects spawn/JoinSet/select!/sync to the wasm-bridged variants that use wasm_bindgen_futures::spawn_local internally, so spawned futures don't need Send.
Conditional ?Send async_trait: URLProvider (xet_client), DataWriter (xet_data), XorbURLProvider, SequentialWriter, UnorderedWriter, UploadSessionDataManager impls. Trait bounds on dyn T stay Send + Sync; only the futures lose Send on wasm.
tokio_with_wasm::alias as tokio shims: every spawn site in the streaming path (download_stream.rs, unordered_download_stream.rs, sequential_writer.rs, unordered_writer.rs, file_term.rs, manager.rs, xet_pkg/upload_*, xet_data/file_upload_session, etc.) gets the wasm-only import.
Internal gating: SyncWriterThread, SequentialWriter::new (disk), DownloadStream::blocking_next, UnorderedDownloadStream::blocking_next, the chunk-cache tokio::spawn in xorb_block.rs, every _blocking method on session types, XetSession::sigint_abort, XetSessionBuilder::with_tokio_handle, and the entire xet_pkg::legacy module are all #[cfg(not(target_family = "wasm"))].
Shared task bridging: TaskRuntime::bridge_async and bridge_async_finalizing (the task-state machine) exist once for both targets via a MaybeSend shim trait (Send supertrait on native, unconstrained on wasm); only run_inner_async (XetRuntime offload vs inline select) and the native-only bridge_sync* pair are cfg-gated.
xet_runtime adjustments: un-gated ClosureGuard (pure Rust). Moved tracing-appender to non-wasm deps and gated the init submodule. XetRuntime::new has a wasm stub (no tokio runtime created — bridge variants in task_runtime.rs .await directly via wasm_bindgen_futures). XetRuntime::spawn_blocking cfg-gates to tokio_with_wasm::task::spawn_blocking on wasm. XetRuntime::{handle, num_worker_threads, spawn, bridge_async, bridge_sync, external_run_async_task} gated to native — wasm callers route through task_runtime.rs bridge variants instead, and gating surfaces this as a compile-time error rather than a runtime panic. Dropped the wasm SystemMonitor and gated the sysinfo-based native one off on wasm. The system_monitor config group keeps its pre-existing native-only gating (absent from the config on wasm).
Filesystem methods gated: FileReconstructor::{reconstruct_to_file, reconstruct_to_writer}, FileDownloadSession::{download_file, download_file_with_id, download_file_background, download_to_writer}, FileUploadSession::{upload_files, spawn_upload_from_path, feed_file_to_cleaner}, XetUploadCommit::upload_from_path.
xet_data::processing and xet_data::file_reconstruction un-gated. FileDownloadSession, FileUploadSession, XetFileInfo, Sha256Policy, DownloadStream, UnorderedDownloadStream, ChunkCache, CacheConfig, create_remote_client are now available on wasm. TranslatorConfig::new is a single fn that skips filesystem directory creation on wasm. New shard_interface/wasm.rs provides an in-memory MDBInMemoryShard-backed SessionShardInterface with a bounded FIFO dedup cache (no disk staging, no resume).
xet_pkg cleanup: reverted the over-aggressive gates from earlier in this PR. XetDownloadStreamGroup, XetDownloadStream, XetSession::new_download_stream_group, XetSession::new_upload_commit, XetUploadCommit::{upload_bytes, upload_stream, commit, abort}, XetStreamUpload, XetFileUpload, XetFileInfo, Sha256Policy, and FileReconstructionError are all on wasm.
wasm/hf_xet_wasm/ (unified example crate, cdylib): wraps xet::xet_session::XetSession with #[wasm_bindgen] and exposes both the upload and download flows from one wasm module. The JS surface mirrors the Rust builder pattern — XetSession is auth-free; auth lives on the per-commit / per-group builder.
- Download: new XetSession() + XetSession.newDownloadStreamGroup(endpoint, token, tokenExpiry) -> Promise<XetDownloadStreamGroup> + XetDownloadStreamGroup.downloadStream(fileInfo, byteRangeStart?, byteRangeEnd?) -> Promise<XetDownloadStream> + XetDownloadStream.next() -> Promise<Uint8Array | undefined> + cancel().
- Upload: XetSession.newUploadCommit(endpoint, token, tokenExpiry) -> Promise<XetUploadCommit> + XetUploadCommit.{uploadBytes, uploadStream, commit, abort} + XetStreamUpload.{write, finish, abort}.
- build_wasm.sh + examples/download.html + examples/upload.html browser test pages. This is not a published browser SDK — real consumers should depend on hf-xet directly with their own #[wasm_bindgen] glue, or use a downstream SDK such as hf-hub.
CI: build step in .github/actions/build-wasm/action.yml, cargo +nightly check --target wasm32-unknown-unknown -p hf-xet compile gate, Cargo.lock freshness checks, plus a 12-scenario headless-Chromium Playwright smoke matrix in wasm/ci-smoke/ (see Testing). The smokes run from a single shared harness: node run.mjs <scenario> drives one harness.html that dynamically imports the per-scenario module from wasm/ci-smoke/scenarios/, with shared token/assert helpers in common.mjs — each scenario file holds only its scenario-specific logic.

WASM API surface (downloads)

XetSessionBuilder::build() → XetSession
XetSession::new_download_stream_group() → XetDownloadStreamGroupBuilder
  .with_endpoint(...)
  .with_token_info(token, expiry)
  .with_token_refresh_url(...)
  .build().await → XetDownloadStreamGroup
XetDownloadStreamGroup::download_stream(file_info, range).await → XetDownloadStream
XetDownloadStream::next().await → Option<Bytes>
XetDownloadStream::cancel()
XetSession::abort() / id() / config()

WASM API surface (uploads)

XetSessionBuilder::build() → XetSession
XetSession::new_upload_commit() → XetUploadCommitBuilder
  .with_endpoint(...)
  .with_token_info(token, expiry)
  .build().await → XetUploadCommit
XetUploadCommit::upload_bytes(bytes, sha256, name).await → XetFileUpload
XetUploadCommit::upload_stream(name, sha256).await → XetStreamUpload
XetStreamUpload::write(chunk).await / finish().await / abort()
XetUploadCommit::commit().await → XetCommitReport
XetUploadCommit::abort()

The unified hf_xet_wasm crate exposes the JS classes XetSession, XetDownloadStreamGroup, XetDownloadStream, XetUploadCommit, XetStreamUpload, and XetFileUpload from a single wasm module.

Testing

Native: all hf-xet (95 unit + 6 doctest) and xet-data (283 unit) tests pass
WASM compile: cargo +nightly check --target wasm32-unknown-unknown -p hf-xet passes (CI)
WASM build: wasm/hf_xet_wasm/build_wasm.sh and wasm/hf_xet_thin_wasm/build_wasm.sh produce pkg/{js,d.ts,wasm} (CI)
WASM browser smokes (CI, headless Chromium + Playwright, consolidated in wasm/ci-smoke/, each invoked as node run.mjs <scenario>):
- invalid-inputs — local-only wasm-side input validation (blocking, no Hub or CAS).
- download — single-file download from prod hub + CAS, asserts byte count + SHA-256 (non-blocking, hub blips don't fail PRs).
- download-multi — concurrent multi-file download in one XetDownloadStreamGroup (non-blocking).
- upload — uploadBytes + commit() (xorb + shard push to CAS) — regression guard for the wasm XetRuntime::spawn_blocking panic (non-blocking).
- upload-stream — streaming variant of upload (uploadStream + XetStreamUpload::{write, finish}) (non-blocking).
- upload-multi — concurrent multi-file uploadBytes in one commit (non-blocking).
- upload-stream-multi — concurrent multi-file uploadStream (non-blocking).
- upload-mixed — uploadBytes + uploadStream concurrently in one commit (non-blocking).
- upload-tiny — 0-byte / 1-byte / 64 KiB files in one commit; catches regressions to the empty-xorb / no-chunks path (non-blocking).
- upload-multi-commit — two sequential XetUploadCommits from one XetSession; catches XetSession-level resource leaks (non-blocking).
- dedup — session-level in-commit dedup: two identical 65 MiB files trigger a cross-xorb dedup hit (non-blocking).
- global-dedup — downloads the pre-seeded deterministic 16 MiB file from the test repo and re-uploads it; asserts the HMAC-keyed global-dedup shard lookup fully dedups the re-upload (xorb_bytes_uploaded === 0, no new xorb) (non-blocking).
Manual browser tests: serve wasm/hf_xet_wasm/examples/download.html or wasm/hf_xet_wasm/examples/upload.html with a static server (the smoke server.mjs sets the required COOP/COEP headers).

Note

High Risk
Touches core async/I/O paths for a new target and runs prod Hub/CAS smokes; regressions affect browser consumers but the wasm job does not block merges (continue-on-error).

Overview
Adds wasm32-unknown-unknown support for the core hf-xet stack via workspace deps (tokio_with_wasm, pinned wasm-bindgen 0.2.121), lockfile updates, and documented wasm-only API/runtime patterns in README and api_changes/update_260515_wasm_target_support.md.

CI / build: .github/actions/build-wasm now caches wasm tool binaries (keyed on nightly rustc), bumps wasm-bindgen-cli / wasm-pack, runs cargo +nightly check -p hf-xet for wasm, and reorders rust cache paths. The Build WASM job is continue-on-error: true but runs a large Playwright matrix under wasm/ci-smoke/ (shared harness, many upload/download/dedup/lifecycle scenarios against prod Hub/CAS when tokens are set).

Example crates: wasm/hf_xet_wasm becomes a thin cdylib JS wrapper around hf-xet (not a published SDK); hf_xet_thin_wasm aligns bindgen versions and cdylib-only output. .gitignore drops checked-in VS Code settings and ignores editor/node_modules trees.

^{Reviewed by Cursor Bugbot for commit 585d975. Bugbot is set up for automated code reviews on this repo. Configure here.}

…_pkg

…sionBuilder

Gate imports and items that depend on non-WASM modules (xet_data::processing, xet_data::file_reconstruction, xet_runtime::core::xet_cache_root) behind #[cfg(not(target_family = "wasm"))]: - error.rs: gate FileReconstructionError import, from_file_reconstruction_error_ref, DataError::FileReconstructionError match arm, and From<FileReconstructionError> impl - lib.rs: gate init_logging (uses xet_cache_root which is non-WASM) - legacy/mod.rs: gate data_client, progress_tracking mods and all re-exports from xet_data::processing - xet_session/mod.rs: gate common, download_stream_group, download_stream_handle mods and their re-exports, and xet_data::processing re-exports - xet_session/session.rs: gate download_stream_group imports, active_download_stream_groups field, new_download_stream_group, register_download_stream_group, and abort's stream group cleanup

…d impls

…download path

…xt, cache-write on non-WASM

…leDownloadSession

…e ClosureGuard on WASM

…ports WASM

The example now hardcodes a Qwen/Qwen-Image-Edit file (overridable in the form) and uses two HF Hub REST endpoints to derive the XetSession inputs: - POST /api/{repo_type}s/{namespace}/{repo}/paths-info/{rev} -> xetHash + size - GET /api/{repo_type}s/{namespace}/{repo}/xet-read-token/{rev} -> { accessToken, exp, casUrl } These are the documented Xet protocol endpoints from https://huggingface.co/docs/xet/file-id and /docs/xet/auth. paths-info is used instead of the /resolve route because resolve returns a 302 we MUST NOT follow and the X-Xet-Hash header is hard to read in a browser fetch on a redirect.

Headless browser test (puppeteer) of wasm/hf_xet_wasm_download against a real HF CAS endpoint surfaced four runtime panics; all are fixed here. 1. XetSessionBuilder::build on wasm called tokio::runtime::Handle::current, which panics when invoked from a JS callback with no enclosing runtime. Switch to XetContext::with_config; the wasm path of XetRuntime::new now constructs a stub XetRuntime (the wasm bridge variants .await directly via wasm_bindgen_futures, so the inner tokio runtime is never driven). 2. xet_runtime::core::runtime called std::process::id() at multiple sites for fork detection; wasm32-unknown-unknown's libstd panics there. Add a current_pid() helper that returns 0 on wasm. 3. xet_core_structures::xorb_object::compression_scheme used std::time::Instant for compression timing telemetry. On the LZ4 decompression path this panics on wasm. Use web_time::Instant which aliases std on native and exposes performance.now() on wasm. 4. ExpWeightedMovingAvg, speed_tracker, and the adaptive_concurrency controller used tokio::time::Instant. On wasm32 that falls through to std::time::Instant::now() and panics. Use tokio::time::Instant on native (so the existing tests' tokio::time::advance simulation still works) and web_time::Instant on wasm via cfg. 5. xet_client adaptive_concurrency controller spawned a partial-completion reporter task via tokio::spawn, which requires being inside a tokio runtime context. Switch to the wasmtokio alias so it uses wasm_bindgen_futures::spawn_local on wasm. Verified end-to-end with a puppeteer test that downloads an 11 MB Xet-backed file from the Qwen/Qwen-Image-Edit repo via the WASM bindings; bytes streamed match the expected size exactly.

…README - Delete docs/design/2026-05-12-xet-pkg-wasm-download-{design,plan}.md now that the feature is implemented. - Add a short "WebAssembly compatibility" section to the root README listing the patterns this codebase relies on (web_time::Instant, tokio_with_wasm::alias, conditional ?Send async-traits, filesystem gating) so future contributors and AI agents touching xet_pkg / xet_client / xet_data / xet_core_structures / xet_runtime know not to regress the wasm build. - Add a README to wasm/hf_xet_wasm_download describing the JS API, the HF Hub endpoints the example calls, and how to do a local browser test, plus a maintainer pointer to the root README.

Pin updates across all three wasm crates, build scripts, READMEs, and the build-wasm GitHub Action. Lockfiles for hf_xet_wasm, hf_xet_thin_wasm, and hf_xet_wasm_download regenerated; js-sys, web-sys, wasm-bindgen-futures, and wasm-bindgen-test pick up matching versions.

Runs the built wasm against prod hub + CAS in headless Chromium and asserts byte count + SHA-256 of a pinned reference file from xet-team/xet-spec-reference-files. Skipped on forks (no token); continue-on-error so a hub blip doesn't block PRs.

The two browser-facing crates are CI smoke wrappers, not published SDKs; say so consistently across both READMEs, both crate-level `//!` docs, the root README, and the api_changes doc. Real browser consumers should depend on `hf-xet` directly with their own `#[wasm_bindgen]` glue. The `hf_xet_wasm_download` README still documented the pre-builder API (`new XetSession(endpoint, token, exp)` → `downloadStream(...)`); rewrite it to match the actual session / group split implemented in the PR. In `xet_pkg/src/lib.rs`, the top-level doctest uses `_blocking` / `upload_from_path` / `download_file_to_path` unconditionally — all non-wasm-only. Add a "WebAssembly targets" section pointing wasm consumers at the async entry points (`new_upload_commit` + `upload_bytes`/`upload_stream`, `new_download_stream_group` + `download_stream`) and at the example crates, and reference the api_changes doc for the full set of wasm-only differences. Use plain text rather than intra-doc links for `legacy` and `new_file_download_group` so wasm-target rustdoc builds don't emit unresolved-link warnings for items that are cfg'd out on wasm.

Both example crates' `validate_session_inputs` documented `0` as "no expiry", but the inner `AuthConfig::maybe_new` only treats *missing* expiry as no-expiry (defaults to `u64::MAX`); an explicit `Some(0)` is preserved as-is, after which `TokenProvider::is_expired()` always returns true (`0 <= cur_time + REFRESH_BUFFER_SEC`) and the next request fails with an auth error because this wrapper does not wire a token refresher. Map `0` to `u64::MAX` at the JS boundary so the documented sentinel actually works for placeholder / local-only flows (the CI upload smoke uses `0` because it never makes a CAS round-trip). Reword the doc comments and READMEs to spell out that any *non-zero* value at or before "now" still fails — the only safe inputs in production are real `exp` values from the Hub `xet-{read,write}-token` response.

…o hf_xet_wasm Merge the two example/smoke-test crates into a single wasm/hf_xet_wasm/ crate exposing one XetSession with both newUploadCommit and newDownloadStreamGroup. The shared helpers (js_err, validate_session_inputs) now live in src/common.rs instead of being copy-pasted, and the download files are renamed to download_group.rs / download_stream.rs for clarity in the combined module tree. CI (build-wasm action, cache action, Cargo.lock freshness check), the ci-smoke pages, the api_changes doc, the root README, and the xet_pkg doc-comment are updated to reference the single crate. Build script, wasm-bindgen pin, and the JS surface are otherwise unchanged.

Drop the patch component from version requirements that weren't using ^ or = explicitly. The lock file resolves a specific patch anyway, so trimming to major.minor lets the lock regenerate against whatever the local registry happens to have without breaking the version spec. Also regenerates wasm/hf_xet_wasm/Cargo.lock against the slightly older mirror set (ctor 1.0.5, tower-http 0.6.10) so ./build_wasm.sh works on machines using the internal HF crates mirror.

The wasm SystemMonitor only surfaced what the browser exposes via `navigator` / `performance` and was never wired up to anything useful, so remove it rather than maintain a second implementation. On wasm, `SystemMonitor` is no longer compiled and `XetRuntime` does not hold or start one. The `system_monitor` config group still compiles on all targets so config keys round-trip; the runtime simply does not read it on wasm.

Per review feedback: collapse the two cfg-gated `let request = ...` bindings into one expression so the native and wasm alternatives are visually paired.

…cache Wasm previously stubbed `query_dedup_shard_by_chunk` to `Ok(false)` because the native path imports fetched dedup shards into a disk-backed `ShardFileManager`, which is incompatible with the wasm32-unknown-unknown sandbox. The CAS-side query itself was already wasm-compatible — only the shard-import side was missing. Now the wasm `SessionShardInterface` keeps a bounded LRU `VecDeque` of fetched shards (cap 32, ~few MB of metadata), parsed in-process via the new `MDBInMemoryShard::from_reader` helper. The cache is kept separate from `session_shard` so server-side shard metadata is never re-uploaded via `upload_and_register_session_shards`. `chunk_hash_dedup_query` walks the session shard first, then the LRU; cache hits bump the matched shard to most-recently-used and report `already_uploaded=true` so callers skip the xorb upload. The unused `#[allow(dead_code)]` on `ctx` is dropped — it's now read in `query_dedup_shard_by_chunk` for `config.data.default_prefix`.

…dmines Replace the placeholder upload smoke (tokenExpiry=0 sentinel, no CAS round-trip) with end-to-end variants that mint a real xet-write-token from the Hub for xet-team/xet-wasm-test and exercise commit() (xorb + shard push to CAS). Cover bytes vs stream and single vs multi-file (3 concurrent uploads per commit) so the per-file aggregation and the xorb-task fan-out are both on the smoke surface. Bugs the strengthened smokes surfaced and this commit also fixes: - xet_data/progress_tracking/upload_tracking.rs:391 used bare tokio::spawn, which panics on wasm ("no reactor running"). Cfg-gate use tokio_with_wasm::alias as tokio so the same call resolves to the wasm-compatible shim, matching the pattern in adjacent files. - xet_data/processing/file_upload_session::finalize_impl took deduplication_metrics before joining xorb_upload_tasks. The spawned tasks update session.deduplication_metrics only after their CAS request resolves, so a take() before join drops their writes into the empty replacement. On native multithreaded the tasks usually completed during the prior .await; on wasm (single-threaded) they were still in flight, surfacing as xorb_bytes_uploaded == 0 in the smoke. Reorder: join first, then take. API change in the wasm wrapper: - validate_session_inputs now rejects tokenExpiry <= 0 instead of rewriting 0 to u64::MAX. Callers must pass the real exp from the Hub response; the sentinel only existed to keep the placeholder smoke alive and is no longer needed. CI: four new non-blocking steps in build_and_test-wasm covering upload (bytes/stream) x (single/multi-file), all sourcing HF_SMOKE_TEST_TOKEN || HF_TOKEN.

Untracks .vscode/settings.json (previously shipped with a default rust formatter pin) so local editor state no longer leaks into git status.

- xorb_block: skip chunk-cache read/write on wasm. Wasm builds have no disk-backed ChunkCache, so the cache_key construction (and the put spawn) are unreachable; cfg-gate them out so the import of Key is also wasm-only and the chunk_cache arg becomes #[allow(unused)] under wasm. - xet_runtime/utils/mod.rs + file_paths_wasm.rs: extract the wasm TemplatedPathBuf stub to its own file. mod.rs now just re-exports from file_paths or file_paths_wasm depending on target; cleaner than inlining the stub. - xet_runtime/utils/rw_task_lock: add the same cfg-gated `use tokio_with_wasm::alias as tokio;` shim used elsewhere, so the tokio::spawn / JoinHandle types resolve correctly on wasm.

Adds focused smoke pages + runners for surfaces the existing smokes didn't reach. Each runs against xet-team/xet-wasm-test (except invalid-inputs, which is local-only), sources HF_SMOKE_TEST_TOKEN || HF_TOKEN, and is wired into build_and_test-wasm: - invalid-inputs: validates validate_session_inputs rejects bad token, endpoint, and tokenExpiry inputs across both newUploadCommit and newDownloadStreamGroup. Blocking (no network) — a regression here would silently weaken the validation surface. - download-multi: two concurrent downloadStream calls from one XetDownloadStreamGroup against pytorch_model.bin + tf_model.h5 on the pinned tiny-random-bert commit. Size delta (540 KiB vs 26 MiB) makes any stream-fan-out buffer crossover unambiguous. - dedup: 65 MiB payload uploaded twice as two files in one commit. The 65 MiB size forces an xorb cut between the two uploadBytes calls (MAX_XORB_BYTES is 64 MiB), so the second file's chunks dedup against the first xorb's entries in session_shard. Smaller payloads don't work: both files end up co-resident in current_session_data with the session_shard still empty. - upload-tiny: 0-byte + 1-byte + 64 KiB files in one commit. Hits the empty-xorb suppression path and the no-chunks branch in the chunker — common bug nurseries. - upload-mixed: uploadBytes + uploadStream concurrently on the same XetUploadCommit, then commit(). Catches handle-tracking bugs where one path's metadata clobbers the other's before commit() aggregates them. - upload-multi-commit: two sequential XetUploadCommits from one XetSession. Each commit constructs its own FileUploadSession; catches XetSession-level leaks (task runtime cleanup, locked state) that would manifest as the second newUploadCommit() hanging. Note on global dedup: a deterministic test of the CAS-indexed dedup path needs a chunk hash satisfying `hash % 1024 == 0`, which random 1 MiB payloads only produce ~1.6% of the time. The session-shard branch of chunk_hash_dedup_query (exercised by the dedup smoke) is on the same surface and covers that regression class.

…drift - server.mjs path-prefix check now uses a `path.sep` boundary so a sibling dir like `<root>-other/` is no longer accepted alongside `<root>/`. - New `wasm/check-version-pins.sh` asserts that every place pinning wasm-bindgen / wasm-pack (workspace Cargo.toml, the two wasm-subcrate Cargo.tomls, build_wasm.sh's WASM_BINDGEN_VERSION, and the action's cargo install + cache key) agrees on a single version. Wired as the first step in the build-wasm action so a partial bump fails CI before any install or build runs.

…pile # Conflicts: # wasm/hf_xet_wasm/src/wasm_file_cleaner.rs # xet_client/src/cas_client/remote_client.rs # xet_data/src/processing/mod.rs

assafvayner

claude generated — inline notes documenting the non-gating behavioral changes in this PR (the "genuine logic changes"). These are the only changes that alter runtime behavior beyond cfg-gating; flagging them for reviewer attention.

assafvayner · 2026-06-08T22:34:32Z

            result??;
        }

+        let mut metrics = take(&mut *self.deduplication_metrics.lock().await);


claude generated — Correctness fix: take(deduplication_metrics) was moved to run after the xorb_upload_tasks join loop above. Each xorb-upload task records its transmitted-byte count into deduplication_metrics only after its CAS request resolves, so taking the metric before the join snapshots an empty DeduplicationMetrics and silently drops those late writes. On native (multi-threaded) the tasks usually finish before this point, masking the race; on single-threaded wasm it reliably surfaced as xorb_bytes_uploaded == 0. This also closes a latent native race — and is safe because no new upload tasks are queued after the join.

assafvayner · 2026-06-08T22:34:32Z

            .xorb
            .simulation_max_bytes
-            .map(|bs| (bs.as_u64() as usize).min(*MAX_XORB_BYTES))
+            .map(|bs| bs.as_u64().min(*MAX_XORB_BYTES as u64) as usize)


claude generated — Correctness fix (32-bit truncation): the order was swapped from (bs.as_u64() as usize).min(*MAX_XORB_BYTES) to bs.as_u64().min(*MAX_XORB_BYTES as u64) as usize. The old form cast a potentially huge u64 to usize before clamping, which truncates on wasm32's 32-bit usize. Clamping in u64 space first guarantees the result fits (MAX_XORB_BYTES = 64 MiB). Same change applied at file_upload_session.rs:442.

assafvayner · 2026-06-08T22:34:32Z

                .xorb
                .simulation_max_bytes
-                .map(|bs| (bs.as_u64() as usize).min(*MAX_XORB_BYTES))
+                .map(|bs| bs.as_u64().min(*MAX_XORB_BYTES as u64) as usize)


claude generated — Correctness fix (32-bit truncation): same fix as file_deduplication.rs:99 — clamp simulation_max_bytes against MAX_XORB_BYTES in u64 space before the as usize cast, so wasm32's 32-bit usize cannot truncate a large configured value.

assafvayner · 2026-06-08T22:34:32Z

+                            // pass the raw Bytes directly (CONTENT_LENGTH is set by reqwest from the body length).
+                            #[cfg(target_family = "wasm")]
+                            {
+                                client.post(url).with_extension(Api(api_tag)).body(serialized_data)


claude generated — Behavior change: native and wasm upload_xorb are now a single code path. The previous separate wasm impl posted the body directly and skipped both the connection permit and RetryWrapper/adaptive-concurrency. wasm now flows through the same retry + permit logic; it differs only in the request body (raw Bytes, since reqwest's wasm backend can't stream request bodies — CONTENT_LENGTH is set by reqwest from the body) and emits a single bulk report_progress(n_transfer_bytes) after success instead of per-chunk streaming updates. Net improvement, but note the wasm upload path now newly exercises the retry/adaptive-concurrency machinery.

assafvayner · 2026-06-08T22:34:32Z

+    /// hits and `true` for dedup-cache hits, since cache entries come from
+    /// shards the server already has. Cache hits also bump the matched
+    /// shard to the most-recently-used position.
+    pub async fn chunk_hash_dedup_query(


claude generated — Correctness note: the (usize, FileDataSequenceEntry, bool) "already uploaded" contract matches native (shard_interface/native.rs::chunk_hash_dedup_query): session-shard hit → false, global-dedup-cache hit → true. wasm correctly omits native's third "resumed session" branch (no resume support on wasm) and mirrors only the two branches that apply. A miss here costs only a re-upload (never corruption), so the in-memory + bounded-LRU divergence from native's on-disk shard managers is low-risk for correctness.

Add job-level continue-on-error to build_and_test-wasm so a wasm failure no longer fails the overall CI run or gates merges/releases. The job still runs the full wasm suite (version-pin check, wasm builds, the cargo check -p hf-xet compile gate, and all 11 browser smokes) on every PR and main push; genuine regressions surface as a red "Build WASM" job (visible signal) while the run stays green. The per-step continue-on-error on the network smokes keeps the job green for hub/CAS blips, so a red job means a real wasm regression.

…pile

…arness - dedupe TaskRuntime::bridge_async{,_finalizing} across targets via a MaybeSend shim; only run_inner_async and bridge_sync* stay cfg-gated - collapse the duplicated wasm variants of XetSessionBuilder::build and TranslatorConfig::new into single fns with a cfg'd binding - wasm shard dedup cache: drop LRU recency machinery for a bounded FIFO - fold system_monitor into all_config_groups! now that its wasm gate is gone - replace XetConfig::validate_usize_bounds with usize::try_from at the ingestion_block_size cast sites - unify wasm cfg predicate in async_read.rs (target_arch -> target_family) - consolidate wasm/ci-smoke into a shared harness (run.mjs scenario table + harness.html + common.mjs + scenarios/*.mjs), replacing 22 near-duplicate runner/page files; ci.yml steps and blocking semantics unchanged - drop wasm/check-version-pins.sh and its build-wasm action step

Production CAS returns HMAC-keyed shards from query_for_global_dedup_shard: the chunk hashes in the shard are keyed with a per-shard key stored in the footer, which MDBInMemoryShard::from_reader skips. The wasm dedup cache was probing keyed lookup tables with raw chunk hashes, so global dedup silently never matched against prod (while working in unkeyed local simulation). Cache entries now carry the shard's MDBShardInfo (header + footer): probe hashes are keyed with chunk_hmac_key() per shard, mirroring the native shard_file_manager, and entries past shard_key_expiry are skipped (0 = no expiry, web_time for the wasm clock). Adds a native unit test building a keyed shard via export_as_keyed_shard_streaming and asserting raw probes miss while keyed probes hit — the only CI-visible guard, since simulation shards are unkeyed.

… xorbs) Downloads the pre-seeded deterministic 16 MiB file from xet-team/xet-wasm-test and re-uploads its bytes in a fresh commit, asserting the chunk-0 global-dedup query fully dedups the re-upload: new_bytes == 0, xorb_bytes_uploaded == 0, shard still pushed. This is the end-to-end regression guard for the HMAC-keyed shard lookup — if the keying regresses, deduped_bytes_by_global_dedup drops to 0 and a full payload of xorbs is pushed. Deterministic because prod CAS indexes the first chunk of every uploaded file and the client always queries chunk 0. First run against an unseeded repo bootstraps: uploads the xorshift32 seed payload and commits it via the Hub NDJSON commit API so paths-info sees it and GC keeps its xorbs. Verified against prod: bootstrap + assert runs pass; with the HMAC fix reverted the scenario fails with the expected signature. Non-blocking CI step like the other network smokes.

…as-casts Scope cleanup of 65ab8ab: keep pre-existing main code untouched where the change had no wasm effect. - xet_runtime config (groups/mod.rs, macros.rs, xet_config.rs): back to main's explicit per-consumer system_monitor appends with their cfg(not(wasm)) gates; the group is absent from the config on wasm as on main. Also drops the PR's validate_usize_bounds (net-zero vs main). - data_client.rs, file_upload_session.rs: restore main's `as usize` casts of ingestion_block_size in the four cfg(not(wasm))-only fns (clean_file, hash_files_async, upload_files, feed_file_to_cleaner) — the try_from guard is meaningless on 64-bit-only code. The wasm-compiled site in file_cleaner.rs keeps the guard. - api_changes doc updated to match.

…pload lifecycle, sha256 policy) - download-range: prefix/mid/suffix byte-range downloads vs reference slices - download-cancel: cancel() mid-stream, group must stay usable - download-error: nonexistent hash + malformed fileInfo must reject, not hang - upload-lifecycle: post-abort/post-commit misuse must reject (wasm mirror of native upload_commit state-machine tests) - sha256-policy: compute/provided/skip metadata + parse_sha256_policy rejects - download-multi: verify per-file content sha256 against pinned values Also make smoke failures visible in CI: drop step-level continue-on-error (the job-level flag still keeps the overall run green and gates nothing), and absorb hub/CAS blips with a single in-runner retry in run.mjs instead.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 585d975. Configure here.}

cursor · 2026-06-11T22:17:43Z

+          `dedup_metrics.xorb_bytes_uploaded=${dedup.xorb_bytes_uploaded}, expected < 1.5 * ${EXPECTED_PAYLOAD_SIZE} ` +
+            `(only one payload's worth of bytes should hit CAS): ${JSON.stringify(dedup)}`,
+        );
+      }


Dedup smoke misses zero xorb

Low Severity

The dedup smoke only upper-bounds commitReport.dedup_metrics.xorb_bytes_uploaded and never requires a positive upload count. When xorb_bytes_uploaded is zero (e.g. metrics taken before xorb upload tasks finish), the check still passes, so the scenario can miss the regression this table comments are meant to guard.

^{Reviewed by Cursor Bugbot for commit 585d975. Configure here.}

seanses

first batch of comments

seanses · 2026-06-11T18:06:57Z

+          ~/.cargo/bin/wasm-pack
+          ~/.cargo/.crates.toml
+          ~/.cargo/.crates2.json
+        key: ${{ runner.os }}-${{ runner.arch }}-cargo-tools-wbg-0.2.121-wp-0.14.0-rustc-${{ steps.rustc-version.outputs.version }}


"wasm-bindgen-cli" and "wasm-pack" are stable native host binaries that don't depend on nightly-only features. So embedding in the rustc nightly version is useless: compilation produces the same artifact as long as the binary release version is identical (e.g. 0.2.121).

seanses · 2026-06-11T18:11:46Z

+    - name: Check hf-xet compiles for wasm32-unknown-unknown
+      shell: bash
+      run: |
+        CARGO_TARGET_WASM32_UNKNOWN_UNKNOWN_RUSTFLAGS="-C target-feature=+atomics,+bulk-memory,+mutable-globals --cfg getrandom_backend=\"wasm_js\"" \


Let's similarly put the build details under the "hf-xet" package (xet_pkg) "build_wasm.sh". And comparing this to "wasm/hf_xet_wasm/build_wasm.sh", a lot of RUSTFLAGS seem missing.

seanses · 2026-06-11T18:12:37Z

          hf_xet/target
-          wasm/hf_xet_wasm/target
          wasm/hf_xet_thin_wasm/target
+          wasm/hf_xet_wasm/target


Seems meaningless reorder but alright..

seanses · 2026-06-11T18:18:04Z

+    # Non-blocking: this job runs the full wasm test suite (version-pin check,
+    # wasm builds, the `cargo check -p hf-xet` compile gate, and all browser
+    # smoke tests) on every PR and main push, but a failure here must never
+    # fail the CI run or gate merges/releases. wasm is an additive target, so
+    # regressions surface as a red "Build WASM" job (visible signal) without
+    # turning the overall run red. Nothing `needs:` this job and the release
+    # workflows are independent, so it blocks no other jobs or releases either.
+    # (The per-step `continue-on-error` on the network smokes below keeps the
+    # job green for hub/CAS blips, so a red job means a real wasm regression.)
+    continue-on-error: true


This is contradictory to our expectation: if we ship WASM compatible hf-xet we don't want to break it in any version.

seanses · 2026-06-11T18:21:04Z

          test -z "$(git status --porcelain Cargo.lock)" || (echo "hf_xet_wasm Cargo.lock has uncommitted changes!" && exit 1)
+      - uses: actions/setup-node@a0853c24544627f65ddf259abe73b1d18a591444  # v5.0.0
+        with:
+          node-version: '20'


Use "24". Github actions are phasing out "20", so should we

seanses · 2026-06-11T18:36:54Z

+The previous separate `wasm/hf_xet_wasm_download` and `wasm/hf_xet_wasm_upload`
+crates have been combined back into `wasm/hf_xet_wasm`.


Looks like this talks about some intermediate work in this PR, ask AI to trim such statements.

seanses · 2026-06-12T08:09:51Z

+files from disk, and `hash_files_async` additionally routes through
+`XetRuntime::spawn_blocking` (which on wasm runs `f` inline anyway —
+unsuitable for the parallel-hash use case). Wasm consumers that need
+hashing must drive it from JS (e.g. `crypto.subtle.digest`) or feed


What is crypto.subtle.digest? That's not the Xet file hash.

seanses · 2026-06-12T08:15:31Z


 [lib]
-crate-type = ["cdylib", "rlib"]
+crate-type = ["cdylib"]


What's the reasoning behind removing "rlib"?

seanses · 2026-06-12T08:20:33Z


 #[cfg(not(target_family = "wasm"))]
 mod file_paths;
+#[cfg(target_family = "wasm")]


Instead of this, I'm thinking to completely cfg-gate XetConfig values that is of type "TemplatedPathBuf", so we don't need to provide a dummy "TemplatedPathBuf" for wasm which is never used.

seanses · 2026-06-12T08:27:57Z

 ] }
+tokio_with_wasm = { workspace = true, features = ["rt", "sync", "time"] }
+web-time = { workspace = true }
+wasm-bindgen = { workspace = true }


Does "xet_runtime" need "wasm-bindgen", "wasm-bindgen-futures" and "js-sys"? I thought they are used to generate JS bindings, i.e. in "wasm/*"

seanses

Regarding XetRuntime

seanses · 2026-06-15T11:45:14Z

-        .enabled
-        .then(|| {
-            SystemMonitor::follow_process(config.system_monitor.sample_interval, config.system_monitor.log_path.clone())
-                .ok()


If we want to inspect the error, we can just add .inspect_err(|e| debug!(...)) before .ok(), instead of rewritting the function

seanses · 2026-06-15T15:22:26Z

@@ -3,15 +3,18 @@ use std::collections::HashMap;
 use std::fmt::Display;
 use std::future::Future;
 use std::panic::AssertUnwindSafe;


This file contains so many cfg-gated code and it's becoming difficult to read, and it still contains many code path that won't be used by WASM at all (e.g. all member fields of XetRuntime). A XetRuntime for WASM target can be extremely simple, how about splitting XetRuntime for WASM into a separate file, and keeping this only for native target: #873

assafvayner mentioned this pull request May 13, 2026

feat(wasm): hf-hub compiles for wasm32-unknown-unknown huggingface/hf-hub#162

Draft

10 tasks

assafvayner added 26 commits May 14, 2026 13:56

docs: add design spec for xet_pkg WASM download support (issue #840)

a6b5050

docs: add implementation plan for xet_pkg WASM download support

5f4b32c

feat(wasm): split tokio dependency into target-specific blocks in xet…

eeefdb4

…_pkg

feat(wasm): add WASM-compatible bridge variants to TaskRuntime

36c43fe

feat(wasm): gate upload/file-download/sigint/handle methods in XetSes…

8bf052f

…sionBuilder

feat(wasm): gate upload and file-download modules from WASM compilation

b1f1920

feat(wasm): gate blocking download methods from WASM compilation

78dc3f5

ci: add hf-xet WASM compile check to build-wasm action

d356993

feat(wasm): gate DiskCache and cache_manager on non-WASM in xet_client

b3bef50

style: apply rustfmt

e55b9cd

feat(wasm): add tokio_with_wasm dependency for xet_data

58bdad9

feat(wasm): add conditional ?Send to URLProvider/DataWriter traits an…

3be456a

…d impls

feat(wasm): swap tokio::spawn/JoinSet for tokio_with_wasm aliases in …

2cfea6c

…download path

feat(wasm): gate SyncWriterThread, SequentialWriter::new, blocking_ne…

db46503

…xt, cache-write on non-WASM

feat(wasm): gate filesystem-using methods of FileReconstructor and Fi…

6f886f6

…leDownloadSession

feat(wasm): un-gate file_reconstruction/processing in xet_data, expos…

7d4e232

…e ClosureGuard on WASM

feat(wasm): un-gate xet_pkg download stream API now that xet_data sup…

5a6e311

…ports WASM

feat(wasm): add hf_xet_wasm_download JS-binding crate

587833a

style: apply rustfmt to wasm refactor

2d668d3

chore: drop unused bytes dep, ignore false-positive machete flags

cf0a39f

chore: refresh hf_xet/Cargo.lock for new tokio_with_wasm transitive dep

cb8b330

assafvayner force-pushed the feat/wasm-xet-pkg-compile branch from 97f9ebf to 989e522 Compare May 14, 2026 20:56

assafvayner added 2 commits May 14, 2026 14:17

update

ee38456

assafvayner added 3 commits May 18, 2026 17:47

assafvayner marked this pull request as ready for review May 19, 2026 17:57

cursor Bot reviewed May 19, 2026

View reviewed changes

Comment thread wasm/ci-smoke/server.mjs

Comment thread wasm/hf_xet_wasm/src/common.rs

assafvayner added 3 commits May 19, 2026 11:01

docs(wasm): replace huggingface.js references with hf-hub

eede83e

hoytak reviewed May 19, 2026

View reviewed changes

Comment thread xet_client/src/cas_client/remote_client.rs

hoytak reviewed May 19, 2026

View reviewed changes

Comment thread xet_core_structures/src/utils/exp_weighted_moving_avg.rs

assafvayner added 8 commits May 19, 2026 16:03

refactor(wasm): nest cfg arms inside single let-request block

28531cf

Per review feedback: collapse the two cfg-gated `let request = ...` bindings into one expression so the native and wasm alternatives are visually paired.

chore: ignore .vscode and .zed editor configs

46122fd

Untracks .vscode/settings.json (previously shipped with a default rust formatter pin) so local editor state no longer leaks into git status.

Merge remote-tracking branch 'origin/main' into feat/wasm-xet-pkg-com…

73cc213

…pile # Conflicts: # wasm/hf_xet_wasm/src/wasm_file_cleaner.rs # xet_client/src/cas_client/remote_client.rs # xet_data/src/processing/mod.rs

assafvayner commented Jun 8, 2026

View reviewed changes

assafvayner added 3 commits June 8, 2026 16:20

Merge remote-tracking branch 'origin/main' into feat/wasm-xet-pkg-com…

40381e5

…pile

cursor Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread wasm/hf_xet_wasm/examples/download.html

Comment thread wasm/hf_xet_wasm/src/download_group.rs

assafvayner added 4 commits June 9, 2026 16:39

cursor Bot reviewed Jun 11, 2026

View reviewed changes

seanses reviewed Jun 12, 2026

View reviewed changes

seanses reviewed Jun 15, 2026

View reviewed changes

		The previous separate `wasm/hf_xet_wasm_download` and `wasm/hf_xet_wasm_upload`
		crates have been combined back into `wasm/hf_xet_wasm`.

Conversation

assafvayner commented May 12, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Two layers of changes

Key changes (runtime compatibility refactor)

WASM API surface (downloads)

WASM API surface (uploads)

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

assafvayner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

Dedup smoke misses zero xorb

Uh oh!

seanses left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seanses left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

assafvayner commented May 12, 2026 •

edited by cursor Bot

Loading