Skip to content

feat(wasm): make hf-xet compile for wasm32-unknown-unknown#841

Open
assafvayner wants to merge 79 commits into
mainfrom
feat/wasm-xet-pkg-compile
Open

feat(wasm): make hf-xet compile for wasm32-unknown-unknown#841
assafvayner wants to merge 79 commits into
mainfrom
feat/wasm-xet-pkg-compile

Conversation

@assafvayner

@assafvayner assafvayner commented May 12, 2026

Copy link
Copy Markdown
Contributor

Closes #840

Summary

Makes hf-xet (xet_pkg) and its streaming download + upload data-prep paths compile and run on wasm32-unknown-unknown, and reworks wasm/hf_xet_wasm/ into a single unified example/smoke-test JS-binding crate (cdylib) that exposes both the download and upload flows from one wasm module. The earlier split into separate hf_xet_wasm_download/ and hf_xet_wasm_upload/ crates was reverted (see ff0c46b8) — one wasm module is enough for the CI smokes and easier for the example pages to share.

The existing thin published crate wasm/hf_xet_thin_wasm/ is unchanged in scope but follows the new shared wasm-bindgen pinning.

Two layers of changes

1. Compile compatibility (commits 12274c07e3c692): cfg-gated tokio dep split, task_runtime.rs WASM bridge variants, xet_pkg::legacy and _blocking/path-based methods gated to non-wasm, cache gating in xet_client.

2. Streaming runtime compatibility (commits 5b7c26e28b9265): make the entire streaming download path !Send-tolerant so it actually executes on WASM (where reqwest's futures are !Send).

A follow-up simplification pass (65ab8aba) deduplicates the wasm/native split where the bodies were identical: TaskRuntime::bridge_async{,_finalizing} are written once via a MaybeSend shim, XetSessionBuilder::build and TranslatorConfig::new are single fns with a cfg'd binding, the wasm shard dedup cache is a plain bounded FIFO, and the ci-smoke suite runs from one shared harness.

Key changes (runtime compatibility refactor)

  • tokio_with_wasm dependency: added to [workspace.dependencies] and to xet_pkg/xet_data/xet_runtime/xet_client wasm-target deps. On native, call sites use real tokio; on wasm, use tokio_with_wasm::alias as tokio redirects spawn/JoinSet/select!/sync to the wasm-bridged variants that use wasm_bindgen_futures::spawn_local internally, so spawned futures don't need Send.
  • Conditional ?Send async_trait: URLProvider (xet_client), DataWriter (xet_data), XorbURLProvider, SequentialWriter, UnorderedWriter, UploadSessionDataManager impls. Trait bounds on dyn T stay Send + Sync; only the futures lose Send on wasm.
  • tokio_with_wasm::alias as tokio shims: every spawn site in the streaming path (download_stream.rs, unordered_download_stream.rs, sequential_writer.rs, unordered_writer.rs, file_term.rs, manager.rs, xet_pkg/upload_*, xet_data/file_upload_session, etc.) gets the wasm-only import.
  • Internal gating: SyncWriterThread, SequentialWriter::new (disk), DownloadStream::blocking_next, UnorderedDownloadStream::blocking_next, the chunk-cache tokio::spawn in xorb_block.rs, every _blocking method on session types, XetSession::sigint_abort, XetSessionBuilder::with_tokio_handle, and the entire xet_pkg::legacy module are all #[cfg(not(target_family = "wasm"))].
  • Shared task bridging: TaskRuntime::bridge_async and bridge_async_finalizing (the task-state machine) exist once for both targets via a MaybeSend shim trait (Send supertrait on native, unconstrained on wasm); only run_inner_async (XetRuntime offload vs inline select) and the native-only bridge_sync* pair are cfg-gated.
  • xet_runtime adjustments: un-gated ClosureGuard (pure Rust). Moved tracing-appender to non-wasm deps and gated the init submodule. XetRuntime::new has a wasm stub (no tokio runtime created — bridge variants in task_runtime.rs .await directly via wasm_bindgen_futures). XetRuntime::spawn_blocking cfg-gates to tokio_with_wasm::task::spawn_blocking on wasm. XetRuntime::{handle, num_worker_threads, spawn, bridge_async, bridge_sync, external_run_async_task} gated to native — wasm callers route through task_runtime.rs bridge variants instead, and gating surfaces this as a compile-time error rather than a runtime panic. Dropped the wasm SystemMonitor and gated the sysinfo-based native one off on wasm. The system_monitor config group keeps its pre-existing native-only gating (absent from the config on wasm).
  • Filesystem methods gated: FileReconstructor::{reconstruct_to_file, reconstruct_to_writer}, FileDownloadSession::{download_file, download_file_with_id, download_file_background, download_to_writer}, FileUploadSession::{upload_files, spawn_upload_from_path, feed_file_to_cleaner}, XetUploadCommit::upload_from_path.
  • xet_data::processing and xet_data::file_reconstruction un-gated. FileDownloadSession, FileUploadSession, XetFileInfo, Sha256Policy, DownloadStream, UnorderedDownloadStream, ChunkCache, CacheConfig, create_remote_client are now available on wasm. TranslatorConfig::new is a single fn that skips filesystem directory creation on wasm. New shard_interface/wasm.rs provides an in-memory MDBInMemoryShard-backed SessionShardInterface with a bounded FIFO dedup cache (no disk staging, no resume).
  • xet_pkg cleanup: reverted the over-aggressive gates from earlier in this PR. XetDownloadStreamGroup, XetDownloadStream, XetSession::new_download_stream_group, XetSession::new_upload_commit, XetUploadCommit::{upload_bytes, upload_stream, commit, abort}, XetStreamUpload, XetFileUpload, XetFileInfo, Sha256Policy, and FileReconstructionError are all on wasm.
  • wasm/hf_xet_wasm/ (unified example crate, cdylib): wraps xet::xet_session::XetSession with #[wasm_bindgen] and exposes both the upload and download flows from one wasm module. The JS surface mirrors the Rust builder pattern — XetSession is auth-free; auth lives on the per-commit / per-group builder.
    • Download: new XetSession() + XetSession.newDownloadStreamGroup(endpoint, token, tokenExpiry) -> Promise<XetDownloadStreamGroup> + XetDownloadStreamGroup.downloadStream(fileInfo, byteRangeStart?, byteRangeEnd?) -> Promise<XetDownloadStream> + XetDownloadStream.next() -> Promise<Uint8Array | undefined> + cancel().
    • Upload: XetSession.newUploadCommit(endpoint, token, tokenExpiry) -> Promise<XetUploadCommit> + XetUploadCommit.{uploadBytes, uploadStream, commit, abort} + XetStreamUpload.{write, finish, abort}.
    • build_wasm.sh + examples/download.html + examples/upload.html browser test pages. This is not a published browser SDK — real consumers should depend on hf-xet directly with their own #[wasm_bindgen] glue, or use a downstream SDK such as hf-hub.
  • CI: build step in .github/actions/build-wasm/action.yml, cargo +nightly check --target wasm32-unknown-unknown -p hf-xet compile gate, Cargo.lock freshness checks, plus a 12-scenario headless-Chromium Playwright smoke matrix in wasm/ci-smoke/ (see Testing). The smokes run from a single shared harness: node run.mjs <scenario> drives one harness.html that dynamically imports the per-scenario module from wasm/ci-smoke/scenarios/, with shared token/assert helpers in common.mjs — each scenario file holds only its scenario-specific logic.

WASM API surface (downloads)

XetSessionBuilder::build() → XetSession
XetSession::new_download_stream_group() → XetDownloadStreamGroupBuilder
  .with_endpoint(...)
  .with_token_info(token, expiry)
  .with_token_refresh_url(...)
  .build().await → XetDownloadStreamGroup
XetDownloadStreamGroup::download_stream(file_info, range).await → XetDownloadStream
XetDownloadStream::next().await → Option<Bytes>
XetDownloadStream::cancel()
XetSession::abort() / id() / config()

WASM API surface (uploads)

XetSessionBuilder::build() → XetSession
XetSession::new_upload_commit() → XetUploadCommitBuilder
  .with_endpoint(...)
  .with_token_info(token, expiry)
  .build().await → XetUploadCommit
XetUploadCommit::upload_bytes(bytes, sha256, name).await → XetFileUpload
XetUploadCommit::upload_stream(name, sha256).await → XetStreamUpload
XetStreamUpload::write(chunk).await / finish().await / abort()
XetUploadCommit::commit().await → XetCommitReport
XetUploadCommit::abort()

The unified hf_xet_wasm crate exposes the JS classes XetSession, XetDownloadStreamGroup, XetDownloadStream, XetUploadCommit, XetStreamUpload, and XetFileUpload from a single wasm module.

Testing

  • Native: all hf-xet (95 unit + 6 doctest) and xet-data (283 unit) tests pass
  • WASM compile: cargo +nightly check --target wasm32-unknown-unknown -p hf-xet passes (CI)
  • WASM build: wasm/hf_xet_wasm/build_wasm.sh and wasm/hf_xet_thin_wasm/build_wasm.sh produce pkg/{js,d.ts,wasm} (CI)
  • WASM browser smokes (CI, headless Chromium + Playwright, consolidated in wasm/ci-smoke/, each invoked as node run.mjs <scenario>):
    • invalid-inputs — local-only wasm-side input validation (blocking, no Hub or CAS).
    • download — single-file download from prod hub + CAS, asserts byte count + SHA-256 (non-blocking, hub blips don't fail PRs).
    • download-multi — concurrent multi-file download in one XetDownloadStreamGroup (non-blocking).
    • uploaduploadBytes + commit() (xorb + shard push to CAS) — regression guard for the wasm XetRuntime::spawn_blocking panic (non-blocking).
    • upload-stream — streaming variant of upload (uploadStream + XetStreamUpload::{write, finish}) (non-blocking).
    • upload-multi — concurrent multi-file uploadBytes in one commit (non-blocking).
    • upload-stream-multi — concurrent multi-file uploadStream (non-blocking).
    • upload-mixeduploadBytes + uploadStream concurrently in one commit (non-blocking).
    • upload-tiny — 0-byte / 1-byte / 64 KiB files in one commit; catches regressions to the empty-xorb / no-chunks path (non-blocking).
    • upload-multi-commit — two sequential XetUploadCommits from one XetSession; catches XetSession-level resource leaks (non-blocking).
    • dedup — session-level in-commit dedup: two identical 65 MiB files trigger a cross-xorb dedup hit (non-blocking).
    • global-dedup — downloads the pre-seeded deterministic 16 MiB file from the test repo and re-uploads it; asserts the HMAC-keyed global-dedup shard lookup fully dedups the re-upload (xorb_bytes_uploaded === 0, no new xorb) (non-blocking).
  • Manual browser tests: serve wasm/hf_xet_wasm/examples/download.html or wasm/hf_xet_wasm/examples/upload.html with a static server (the smoke server.mjs sets the required COOP/COEP headers).

Note

High Risk
Touches core async/I/O paths for a new target and runs prod Hub/CAS smokes; regressions affect browser consumers but the wasm job does not block merges (continue-on-error).

Overview
Adds wasm32-unknown-unknown support for the core hf-xet stack via workspace deps (tokio_with_wasm, pinned wasm-bindgen 0.2.121), lockfile updates, and documented wasm-only API/runtime patterns in README and api_changes/update_260515_wasm_target_support.md.

CI / build: .github/actions/build-wasm now caches wasm tool binaries (keyed on nightly rustc), bumps wasm-bindgen-cli / wasm-pack, runs cargo +nightly check -p hf-xet for wasm, and reorders rust cache paths. The Build WASM job is continue-on-error: true but runs a large Playwright matrix under wasm/ci-smoke/ (shared harness, many upload/download/dedup/lifecycle scenarios against prod Hub/CAS when tokens are set).

Example crates: wasm/hf_xet_wasm becomes a thin cdylib JS wrapper around hf-xet (not a published SDK); hf_xet_thin_wasm aligns bindgen versions and cdylib-only output. .gitignore drops checked-in VS Code settings and ignores editor/node_modules trees.

Reviewed by Cursor Bugbot for commit 585d975. Bugbot is set up for automated code reviews on this repo. Configure here.

Gate imports and items that depend on non-WASM modules (xet_data::processing,
xet_data::file_reconstruction, xet_runtime::core::xet_cache_root) behind
#[cfg(not(target_family = "wasm"))]:

- error.rs: gate FileReconstructionError import, from_file_reconstruction_error_ref,
  DataError::FileReconstructionError match arm, and From<FileReconstructionError> impl
- lib.rs: gate init_logging (uses xet_cache_root which is non-WASM)
- legacy/mod.rs: gate data_client, progress_tracking mods and all re-exports from
  xet_data::processing
- xet_session/mod.rs: gate common, download_stream_group, download_stream_handle mods
  and their re-exports, and xet_data::processing re-exports
- xet_session/session.rs: gate download_stream_group imports, active_download_stream_groups
  field, new_download_stream_group, register_download_stream_group, and abort's stream
  group cleanup
The example now hardcodes a Qwen/Qwen-Image-Edit file (overridable in the form)
and uses two HF Hub REST endpoints to derive the XetSession inputs:

- POST /api/{repo_type}s/{namespace}/{repo}/paths-info/{rev} -> xetHash + size
- GET  /api/{repo_type}s/{namespace}/{repo}/xet-read-token/{rev} -> { accessToken, exp, casUrl }

These are the documented Xet protocol endpoints from
https://huggingface.co/docs/xet/file-id and /docs/xet/auth. paths-info is used
instead of the /resolve route because resolve returns a 302 we MUST NOT follow
and the X-Xet-Hash header is hard to read in a browser fetch on a redirect.
Headless browser test (puppeteer) of wasm/hf_xet_wasm_download against a
real HF CAS endpoint surfaced four runtime panics; all are fixed here.

1. XetSessionBuilder::build on wasm called tokio::runtime::Handle::current,
   which panics when invoked from a JS callback with no enclosing runtime.
   Switch to XetContext::with_config; the wasm path of XetRuntime::new now
   constructs a stub XetRuntime (the wasm bridge variants .await directly
   via wasm_bindgen_futures, so the inner tokio runtime is never driven).

2. xet_runtime::core::runtime called std::process::id() at multiple sites
   for fork detection; wasm32-unknown-unknown's libstd panics there. Add
   a current_pid() helper that returns 0 on wasm.

3. xet_core_structures::xorb_object::compression_scheme used
   std::time::Instant for compression timing telemetry. On the LZ4
   decompression path this panics on wasm. Use web_time::Instant which
   aliases std on native and exposes performance.now() on wasm.

4. ExpWeightedMovingAvg, speed_tracker, and the adaptive_concurrency
   controller used tokio::time::Instant. On wasm32 that falls through to
   std::time::Instant::now() and panics. Use tokio::time::Instant on
   native (so the existing tests' tokio::time::advance simulation still
   works) and web_time::Instant on wasm via cfg.

5. xet_client adaptive_concurrency controller spawned a partial-completion
   reporter task via tokio::spawn, which requires being inside a tokio
   runtime context. Switch to the wasmtokio alias so it uses
   wasm_bindgen_futures::spawn_local on wasm.

Verified end-to-end with a puppeteer test that downloads an 11 MB Xet-backed
file from the Qwen/Qwen-Image-Edit repo via the WASM bindings; bytes streamed
match the expected size exactly.
…README

- Delete docs/design/2026-05-12-xet-pkg-wasm-download-{design,plan}.md now
  that the feature is implemented.
- Add a short "WebAssembly compatibility" section to the root README
  listing the patterns this codebase relies on (web_time::Instant,
  tokio_with_wasm::alias, conditional ?Send async-traits, filesystem
  gating) so future contributors and AI agents touching xet_pkg /
  xet_client / xet_data / xet_core_structures / xet_runtime know not to
  regress the wasm build.
- Add a README to wasm/hf_xet_wasm_download describing the JS API, the
  HF Hub endpoints the example calls, and how to do a local browser
  test, plus a maintainer pointer to the root README.
Pin updates across all three wasm crates, build scripts, READMEs, and the
build-wasm GitHub Action. Lockfiles for hf_xet_wasm, hf_xet_thin_wasm, and
hf_xet_wasm_download regenerated; js-sys, web-sys, wasm-bindgen-futures,
and wasm-bindgen-test pick up matching versions.
@assafvayner assafvayner force-pushed the feat/wasm-xet-pkg-compile branch from 97f9ebf to 989e522 Compare May 14, 2026 20:56
Runs the built wasm against prod hub + CAS in headless Chromium and
asserts byte count + SHA-256 of a pinned reference file from
xet-team/xet-spec-reference-files. Skipped on forks (no token);
continue-on-error so a hub blip doesn't block PRs.
The two browser-facing crates are CI smoke wrappers, not published SDKs;
say so consistently across both READMEs, both crate-level `//!` docs,
the root README, and the api_changes doc. Real browser consumers should
depend on `hf-xet` directly with their own `#[wasm_bindgen]` glue.

The `hf_xet_wasm_download` README still documented the pre-builder API
(`new XetSession(endpoint, token, exp)` → `downloadStream(...)`); rewrite
it to match the actual session / group split implemented in the PR.

In `xet_pkg/src/lib.rs`, the top-level doctest uses `_blocking` /
`upload_from_path` / `download_file_to_path` unconditionally — all
non-wasm-only. Add a "WebAssembly targets" section pointing wasm
consumers at the async entry points (`new_upload_commit` +
`upload_bytes`/`upload_stream`, `new_download_stream_group` +
`download_stream`) and at the example crates, and reference the
api_changes doc for the full set of wasm-only differences. Use plain
text rather than intra-doc links for `legacy` and `new_file_download_group`
so wasm-target rustdoc builds don't emit unresolved-link warnings for
items that are cfg'd out on wasm.
Both example crates' `validate_session_inputs` documented `0` as "no
expiry", but the inner `AuthConfig::maybe_new` only treats *missing*
expiry as no-expiry (defaults to `u64::MAX`); an explicit `Some(0)` is
preserved as-is, after which `TokenProvider::is_expired()` always
returns true (`0 <= cur_time + REFRESH_BUFFER_SEC`) and the next
request fails with an auth error because this wrapper does not wire a
token refresher.

Map `0` to `u64::MAX` at the JS boundary so the documented sentinel
actually works for placeholder / local-only flows (the CI upload smoke
uses `0` because it never makes a CAS round-trip). Reword the doc
comments and READMEs to spell out that any *non-zero* value at or
before "now" still fails — the only safe inputs in production are real
`exp` values from the Hub `xet-{read,write}-token` response.
…o hf_xet_wasm

Merge the two example/smoke-test crates into a single wasm/hf_xet_wasm/
crate exposing one XetSession with both newUploadCommit and
newDownloadStreamGroup. The shared helpers (js_err,
validate_session_inputs) now live in src/common.rs instead of being
copy-pasted, and the download files are renamed to download_group.rs /
download_stream.rs for clarity in the combined module tree.

CI (build-wasm action, cache action, Cargo.lock freshness check),
the ci-smoke pages, the api_changes doc, the root README, and the
xet_pkg doc-comment are updated to reference the single crate.
Build script, wasm-bindgen pin, and the JS surface are otherwise
unchanged.
@assafvayner assafvayner marked this pull request as ready for review May 19, 2026 17:57
Comment thread wasm/ci-smoke/server.mjs
Comment thread wasm/hf_xet_wasm/src/common.rs
Drop the patch component from version requirements that weren't using
^ or = explicitly. The lock file resolves a specific patch anyway, so
trimming to major.minor lets the lock regenerate against whatever the
local registry happens to have without breaking the version spec.

Also regenerates wasm/hf_xet_wasm/Cargo.lock against the slightly older
mirror set (ctor 1.0.5, tower-http 0.6.10) so ./build_wasm.sh works on
machines using the internal HF crates mirror.
The wasm SystemMonitor only surfaced what the browser exposes via
`navigator` / `performance` and was never wired up to anything useful,
so remove it rather than maintain a second implementation. On wasm,
`SystemMonitor` is no longer compiled and `XetRuntime` does not hold or
start one. The `system_monitor` config group still compiles on all
targets so config keys round-trip; the runtime simply does not read it
on wasm.
Comment thread xet_client/src/cas_client/remote_client.rs
Comment thread xet_core_structures/src/utils/exp_weighted_moving_avg.rs
Per review feedback: collapse the two cfg-gated `let request = ...`
bindings into one expression so the native and wasm alternatives are
visually paired.
…cache

Wasm previously stubbed `query_dedup_shard_by_chunk` to `Ok(false)`
because the native path imports fetched dedup shards into a disk-backed
`ShardFileManager`, which is incompatible with the
wasm32-unknown-unknown sandbox. The CAS-side query itself was already
wasm-compatible — only the shard-import side was missing.

Now the wasm `SessionShardInterface` keeps a bounded LRU `VecDeque` of
fetched shards (cap 32, ~few MB of metadata), parsed in-process via the
new `MDBInMemoryShard::from_reader` helper. The cache is kept separate
from `session_shard` so server-side shard metadata is never re-uploaded
via `upload_and_register_session_shards`. `chunk_hash_dedup_query`
walks the session shard first, then the LRU; cache hits bump the
matched shard to most-recently-used and report `already_uploaded=true`
so callers skip the xorb upload.

The unused `#[allow(dead_code)]` on `ctx` is dropped — it's now read in
`query_dedup_shard_by_chunk` for `config.data.default_prefix`.
…dmines

Replace the placeholder upload smoke (tokenExpiry=0 sentinel, no CAS
round-trip) with end-to-end variants that mint a real xet-write-token
from the Hub for xet-team/xet-wasm-test and exercise commit() (xorb +
shard push to CAS). Cover bytes vs stream and single vs multi-file
(3 concurrent uploads per commit) so the per-file aggregation and the
xorb-task fan-out are both on the smoke surface.

Bugs the strengthened smokes surfaced and this commit also fixes:

- xet_data/progress_tracking/upload_tracking.rs:391 used bare
  tokio::spawn, which panics on wasm ("no reactor running"). Cfg-gate
  use tokio_with_wasm::alias as tokio so the same call resolves to
  the wasm-compatible shim, matching the pattern in adjacent files.

- xet_data/processing/file_upload_session::finalize_impl took
  deduplication_metrics before joining xorb_upload_tasks. The spawned
  tasks update session.deduplication_metrics only after their CAS
  request resolves, so a take() before join drops their writes into
  the empty replacement. On native multithreaded the tasks usually
  completed during the prior .await; on wasm (single-threaded) they
  were still in flight, surfacing as xorb_bytes_uploaded == 0 in the
  smoke. Reorder: join first, then take.

API change in the wasm wrapper:

- validate_session_inputs now rejects tokenExpiry <= 0 instead of
  rewriting 0 to u64::MAX. Callers must pass the real exp from the
  Hub response; the sentinel only existed to keep the placeholder
  smoke alive and is no longer needed.

CI: four new non-blocking steps in build_and_test-wasm covering
upload (bytes/stream) x (single/multi-file), all sourcing
HF_SMOKE_TEST_TOKEN || HF_TOKEN.
Untracks .vscode/settings.json (previously shipped with a default rust
formatter pin) so local editor state no longer leaks into git status.
- xorb_block: skip chunk-cache read/write on wasm. Wasm builds have no
  disk-backed ChunkCache, so the cache_key construction (and the put
  spawn) are unreachable; cfg-gate them out so the import of Key is
  also wasm-only and the chunk_cache arg becomes #[allow(unused)] under
  wasm.

- xet_runtime/utils/mod.rs + file_paths_wasm.rs: extract the wasm
  TemplatedPathBuf stub to its own file. mod.rs now just re-exports
  from file_paths or file_paths_wasm depending on target; cleaner than
  inlining the stub.

- xet_runtime/utils/rw_task_lock: add the same cfg-gated
  `use tokio_with_wasm::alias as tokio;` shim used elsewhere, so the
  tokio::spawn / JoinHandle types resolve correctly on wasm.
Adds focused smoke pages + runners for surfaces the existing smokes
didn't reach. Each runs against xet-team/xet-wasm-test (except
invalid-inputs, which is local-only), sources HF_SMOKE_TEST_TOKEN ||
HF_TOKEN, and is wired into build_and_test-wasm:

- invalid-inputs: validates validate_session_inputs rejects bad token,
  endpoint, and tokenExpiry inputs across both newUploadCommit and
  newDownloadStreamGroup. Blocking (no network) — a regression here
  would silently weaken the validation surface.

- download-multi: two concurrent downloadStream calls from one
  XetDownloadStreamGroup against pytorch_model.bin + tf_model.h5 on
  the pinned tiny-random-bert commit. Size delta (540 KiB vs 26 MiB)
  makes any stream-fan-out buffer crossover unambiguous.

- dedup: 65 MiB payload uploaded twice as two files in one commit.
  The 65 MiB size forces an xorb cut between the two uploadBytes
  calls (MAX_XORB_BYTES is 64 MiB), so the second file's chunks
  dedup against the first xorb's entries in session_shard.
  Smaller payloads don't work: both files end up co-resident in
  current_session_data with the session_shard still empty.

- upload-tiny: 0-byte + 1-byte + 64 KiB files in one commit. Hits
  the empty-xorb suppression path and the no-chunks branch in the
  chunker — common bug nurseries.

- upload-mixed: uploadBytes + uploadStream concurrently on the same
  XetUploadCommit, then commit(). Catches handle-tracking bugs
  where one path's metadata clobbers the other's before
  commit() aggregates them.

- upload-multi-commit: two sequential XetUploadCommits from one
  XetSession. Each commit constructs its own FileUploadSession;
  catches XetSession-level leaks (task runtime cleanup, locked
  state) that would manifest as the second newUploadCommit() hanging.

Note on global dedup: a deterministic test of the CAS-indexed dedup
path needs a chunk hash satisfying `hash % 1024 == 0`, which random
1 MiB payloads only produce ~1.6% of the time. The session-shard
branch of chunk_hash_dedup_query (exercised by the dedup smoke) is
on the same surface and covers that regression class.
…drift

- server.mjs path-prefix check now uses a `path.sep` boundary so a sibling
  dir like `<root>-other/` is no longer accepted alongside `<root>/`.
- New `wasm/check-version-pins.sh` asserts that every place pinning
  wasm-bindgen / wasm-pack (workspace Cargo.toml, the two wasm-subcrate
  Cargo.tomls, build_wasm.sh's WASM_BINDGEN_VERSION, and the action's
  cargo install + cache key) agrees on a single version. Wired as the
  first step in the build-wasm action so a partial bump fails CI before
  any install or build runs.
…pile

# Conflicts:
#	wasm/hf_xet_wasm/src/wasm_file_cleaner.rs
#	xet_client/src/cas_client/remote_client.rs
#	xet_data/src/processing/mod.rs

@assafvayner assafvayner left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude generated — inline notes documenting the non-gating behavioral changes in this PR (the "genuine logic changes"). These are the only changes that alter runtime behavior beyond cfg-gating; flagging them for reviewer attention.

result??;
}

let mut metrics = take(&mut *self.deduplication_metrics.lock().await);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude generated — Correctness fix: take(deduplication_metrics) was moved to run after the xorb_upload_tasks join loop above. Each xorb-upload task records its transmitted-byte count into deduplication_metrics only after its CAS request resolves, so taking the metric before the join snapshots an empty DeduplicationMetrics and silently drops those late writes. On native (multi-threaded) the tasks usually finish before this point, masking the race; on single-threaded wasm it reliably surfaced as xorb_bytes_uploaded == 0. This also closes a latent native race — and is safe because no new upload tasks are queued after the join.

.xorb
.simulation_max_bytes
.map(|bs| (bs.as_u64() as usize).min(*MAX_XORB_BYTES))
.map(|bs| bs.as_u64().min(*MAX_XORB_BYTES as u64) as usize)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude generated — Correctness fix (32-bit truncation): the order was swapped from (bs.as_u64() as usize).min(*MAX_XORB_BYTES) to bs.as_u64().min(*MAX_XORB_BYTES as u64) as usize. The old form cast a potentially huge u64 to usize before clamping, which truncates on wasm32's 32-bit usize. Clamping in u64 space first guarantees the result fits (MAX_XORB_BYTES = 64 MiB). Same change applied at file_upload_session.rs:442.

.xorb
.simulation_max_bytes
.map(|bs| (bs.as_u64() as usize).min(*MAX_XORB_BYTES))
.map(|bs| bs.as_u64().min(*MAX_XORB_BYTES as u64) as usize)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude generated — Correctness fix (32-bit truncation): same fix as file_deduplication.rs:99 — clamp simulation_max_bytes against MAX_XORB_BYTES in u64 space before the as usize cast, so wasm32's 32-bit usize cannot truncate a large configured value.

// pass the raw Bytes directly (CONTENT_LENGTH is set by reqwest from the body length).
#[cfg(target_family = "wasm")]
{
client.post(url).with_extension(Api(api_tag)).body(serialized_data)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude generated — Behavior change: native and wasm upload_xorb are now a single code path. The previous separate wasm impl posted the body directly and skipped both the connection permit and RetryWrapper/adaptive-concurrency. wasm now flows through the same retry + permit logic; it differs only in the request body (raw Bytes, since reqwest's wasm backend can't stream request bodies — CONTENT_LENGTH is set by reqwest from the body) and emits a single bulk report_progress(n_transfer_bytes) after success instead of per-chunk streaming updates. Net improvement, but note the wasm upload path now newly exercises the retry/adaptive-concurrency machinery.

/// hits and `true` for dedup-cache hits, since cache entries come from
/// shards the server already has. Cache hits also bump the matched
/// shard to the most-recently-used position.
pub async fn chunk_hash_dedup_query(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude generated — Correctness note: the (usize, FileDataSequenceEntry, bool) "already uploaded" contract matches native (shard_interface/native.rs::chunk_hash_dedup_query): session-shard hit → false, global-dedup-cache hit → true. wasm correctly omits native's third "resumed session" branch (no resume support on wasm) and mirrors only the two branches that apply. A miss here costs only a re-upload (never corruption), so the in-memory + bounded-LRU divergence from native's on-disk shard managers is low-risk for correctness.

Add job-level continue-on-error to build_and_test-wasm so a wasm
failure no longer fails the overall CI run or gates merges/releases.
The job still runs the full wasm suite (version-pin check, wasm builds,
the cargo check -p hf-xet compile gate, and all 11 browser smokes) on
every PR and main push; genuine regressions surface as a red "Build
WASM" job (visible signal) while the run stays green. The per-step
continue-on-error on the network smokes keeps the job green for
hub/CAS blips, so a red job means a real wasm regression.
…arness

- dedupe TaskRuntime::bridge_async{,_finalizing} across targets via a
  MaybeSend shim; only run_inner_async and bridge_sync* stay cfg-gated
- collapse the duplicated wasm variants of XetSessionBuilder::build and
  TranslatorConfig::new into single fns with a cfg'd binding
- wasm shard dedup cache: drop LRU recency machinery for a bounded FIFO
- fold system_monitor into all_config_groups! now that its wasm gate is gone
- replace XetConfig::validate_usize_bounds with usize::try_from at the
  ingestion_block_size cast sites
- unify wasm cfg predicate in async_read.rs (target_arch -> target_family)
- consolidate wasm/ci-smoke into a shared harness (run.mjs scenario table +
  harness.html + common.mjs + scenarios/*.mjs), replacing 22 near-duplicate
  runner/page files; ci.yml steps and blocking semantics unchanged
- drop wasm/check-version-pins.sh and its build-wasm action step
Comment thread wasm/hf_xet_wasm/examples/download.html
Comment thread wasm/hf_xet_wasm/src/download_group.rs
Production CAS returns HMAC-keyed shards from query_for_global_dedup_shard:
the chunk hashes in the shard are keyed with a per-shard key stored in the
footer, which MDBInMemoryShard::from_reader skips. The wasm dedup cache was
probing keyed lookup tables with raw chunk hashes, so global dedup silently
never matched against prod (while working in unkeyed local simulation).

Cache entries now carry the shard's MDBShardInfo (header + footer): probe
hashes are keyed with chunk_hmac_key() per shard, mirroring the native
shard_file_manager, and entries past shard_key_expiry are skipped (0 = no
expiry, web_time for the wasm clock).

Adds a native unit test building a keyed shard via
export_as_keyed_shard_streaming and asserting raw probes miss while keyed
probes hit — the only CI-visible guard, since simulation shards are unkeyed.
… xorbs)

Downloads the pre-seeded deterministic 16 MiB file from
xet-team/xet-wasm-test and re-uploads its bytes in a fresh commit,
asserting the chunk-0 global-dedup query fully dedups the re-upload:
new_bytes == 0, xorb_bytes_uploaded == 0, shard still pushed. This is the
end-to-end regression guard for the HMAC-keyed shard lookup — if the
keying regresses, deduped_bytes_by_global_dedup drops to 0 and a full
payload of xorbs is pushed.

Deterministic because prod CAS indexes the first chunk of every uploaded
file and the client always queries chunk 0. First run against an unseeded
repo bootstraps: uploads the xorshift32 seed payload and commits it via
the Hub NDJSON commit API so paths-info sees it and GC keeps its xorbs.

Verified against prod: bootstrap + assert runs pass; with the HMAC fix
reverted the scenario fails with the expected signature. Non-blocking CI
step like the other network smokes.
…as-casts

Scope cleanup of 65ab8ab: keep pre-existing main code untouched where the
change had no wasm effect.

- xet_runtime config (groups/mod.rs, macros.rs, xet_config.rs): back to
  main's explicit per-consumer system_monitor appends with their
  cfg(not(wasm)) gates; the group is absent from the config on wasm as on
  main. Also drops the PR's validate_usize_bounds (net-zero vs main).
- data_client.rs, file_upload_session.rs: restore main's `as usize` casts
  of ingestion_block_size in the four cfg(not(wasm))-only fns (clean_file,
  hash_files_async, upload_files, feed_file_to_cleaner) — the try_from
  guard is meaningless on 64-bit-only code. The wasm-compiled site in
  file_cleaner.rs keeps the guard.
- api_changes doc updated to match.
…pload lifecycle, sha256 policy)

- download-range: prefix/mid/suffix byte-range downloads vs reference slices
- download-cancel: cancel() mid-stream, group must stay usable
- download-error: nonexistent hash + malformed fileInfo must reject, not hang
- upload-lifecycle: post-abort/post-commit misuse must reject (wasm mirror of
  native upload_commit state-machine tests)
- sha256-policy: compute/provided/skip metadata + parse_sha256_policy rejects
- download-multi: verify per-file content sha256 against pinned values

Also make smoke failures visible in CI: drop step-level continue-on-error
(the job-level flag still keeps the overall run green and gates nothing),
and absorb hub/CAS blips with a single in-runner retry in run.mjs instead.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 585d975. Configure here.

Comment thread wasm/ci-smoke/run.mjs
`dedup_metrics.xorb_bytes_uploaded=${dedup.xorb_bytes_uploaded}, expected < 1.5 * ${EXPECTED_PAYLOAD_SIZE} ` +
`(only one payload's worth of bytes should hit CAS): ${JSON.stringify(dedup)}`,
);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dedup smoke misses zero xorb

Low Severity

The dedup smoke only upper-bounds commitReport.dedup_metrics.xorb_bytes_uploaded and never requires a positive upload count. When xorb_bytes_uploaded is zero (e.g. metrics taken before xorb upload tasks finish), the check still passes, so the scenario can miss the regression this table comments are meant to guard.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 585d975. Configure here.

@seanses seanses left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first batch of comments

~/.cargo/bin/wasm-pack
~/.cargo/.crates.toml
~/.cargo/.crates2.json
key: ${{ runner.os }}-${{ runner.arch }}-cargo-tools-wbg-0.2.121-wp-0.14.0-rustc-${{ steps.rustc-version.outputs.version }}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"wasm-bindgen-cli" and "wasm-pack" are stable native host binaries that don't depend on nightly-only features. So embedding in the rustc nightly version is useless: compilation produces the same artifact as long as the binary release version is identical (e.g. 0.2.121).

- name: Check hf-xet compiles for wasm32-unknown-unknown
shell: bash
run: |
CARGO_TARGET_WASM32_UNKNOWN_UNKNOWN_RUSTFLAGS="-C target-feature=+atomics,+bulk-memory,+mutable-globals --cfg getrandom_backend=\"wasm_js\"" \

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's similarly put the build details under the "hf-xet" package (xet_pkg) "build_wasm.sh". And comparing this to "wasm/hf_xet_wasm/build_wasm.sh", a lot of RUSTFLAGS seem missing.

hf_xet/target
wasm/hf_xet_wasm/target
wasm/hf_xet_thin_wasm/target
wasm/hf_xet_wasm/target

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems meaningless reorder but alright..

Comment thread .github/workflows/ci.yml
Comment on lines +186 to +195
# Non-blocking: this job runs the full wasm test suite (version-pin check,
# wasm builds, the `cargo check -p hf-xet` compile gate, and all browser
# smoke tests) on every PR and main push, but a failure here must never
# fail the CI run or gate merges/releases. wasm is an additive target, so
# regressions surface as a red "Build WASM" job (visible signal) without
# turning the overall run red. Nothing `needs:` this job and the release
# workflows are independent, so it blocks no other jobs or releases either.
# (The per-step `continue-on-error` on the network smokes below keeps the
# job green for hub/CAS blips, so a red job means a real wasm regression.)
continue-on-error: true

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is contradictory to our expectation: if we ship WASM compatible hf-xet we don't want to break it in any version.

Comment thread .github/workflows/ci.yml
test -z "$(git status --porcelain Cargo.lock)" || (echo "hf_xet_wasm Cargo.lock has uncommitted changes!" && exit 1)
- uses: actions/setup-node@a0853c24544627f65ddf259abe73b1d18a591444 # v5.0.0
with:
node-version: '20'

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use "24". Github actions are phasing out "20", so should we

Comment on lines +30 to +31
The previous separate `wasm/hf_xet_wasm_download` and `wasm/hf_xet_wasm_upload`
crates have been combined back into `wasm/hf_xet_wasm`.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this talks about some intermediate work in this PR, ask AI to trim such statements.

files from disk, and `hash_files_async` additionally routes through
`XetRuntime::spawn_blocking` (which on wasm runs `f` inline anyway —
unsuitable for the parallel-hash use case). Wasm consumers that need
hashing must drive it from JS (e.g. `crypto.subtle.digest`) or feed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is crypto.subtle.digest? That's not the Xet file hash.


[lib]
crate-type = ["cdylib", "rlib"]
crate-type = ["cdylib"]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind removing "rlib"?


#[cfg(not(target_family = "wasm"))]
mod file_paths;
#[cfg(target_family = "wasm")]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, I'm thinking to completely cfg-gate XetConfig values that is of type "TemplatedPathBuf", so we don't need to provide a dummy "TemplatedPathBuf" for wasm which is never used.

Comment thread xet_runtime/Cargo.toml
] }
tokio_with_wasm = { workspace = true, features = ["rt", "sync", "time"] }
web-time = { workspace = true }
wasm-bindgen = { workspace = true }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "xet_runtime" need "wasm-bindgen", "wasm-bindgen-futures" and "js-sys"? I thought they are used to generate JS bindings, i.e. in "wasm/*"

@seanses seanses left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding XetRuntime

.enabled
.then(|| {
SystemMonitor::follow_process(config.system_monitor.sample_interval, config.system_monitor.log_path.clone())
.ok()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to inspect the error, we can just add .inspect_err(|e| debug!(...)) before .ok(), instead of rewritting the function

@@ -3,15 +3,18 @@ use std::collections::HashMap;
use std::fmt::Display;
use std::future::Future;
use std::panic::AssertUnwindSafe;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file contains so many cfg-gated code and it's becoming difficult to read, and it still contains many code path that won't be used by WASM at all (e.g. all member fields of XetRuntime). A XetRuntime for WASM target can be extremely simple, how about splitting XetRuntime for WASM into a separate file, and keeping this only for native target: #873

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hf-xet rust crate (xet_pkg) should support WASM target

3 participants