feat(wasm): make hf-xet compile for wasm32-unknown-unknown#841
feat(wasm): make hf-xet compile for wasm32-unknown-unknown#841assafvayner wants to merge 79 commits into
Conversation
Gate imports and items that depend on non-WASM modules (xet_data::processing, xet_data::file_reconstruction, xet_runtime::core::xet_cache_root) behind #[cfg(not(target_family = "wasm"))]: - error.rs: gate FileReconstructionError import, from_file_reconstruction_error_ref, DataError::FileReconstructionError match arm, and From<FileReconstructionError> impl - lib.rs: gate init_logging (uses xet_cache_root which is non-WASM) - legacy/mod.rs: gate data_client, progress_tracking mods and all re-exports from xet_data::processing - xet_session/mod.rs: gate common, download_stream_group, download_stream_handle mods and their re-exports, and xet_data::processing re-exports - xet_session/session.rs: gate download_stream_group imports, active_download_stream_groups field, new_download_stream_group, register_download_stream_group, and abort's stream group cleanup
…xt, cache-write on non-WASM
…leDownloadSession
…e ClosureGuard on WASM
The example now hardcodes a Qwen/Qwen-Image-Edit file (overridable in the form)
and uses two HF Hub REST endpoints to derive the XetSession inputs:
- POST /api/{repo_type}s/{namespace}/{repo}/paths-info/{rev} -> xetHash + size
- GET /api/{repo_type}s/{namespace}/{repo}/xet-read-token/{rev} -> { accessToken, exp, casUrl }
These are the documented Xet protocol endpoints from
https://huggingface.co/docs/xet/file-id and /docs/xet/auth. paths-info is used
instead of the /resolve route because resolve returns a 302 we MUST NOT follow
and the X-Xet-Hash header is hard to read in a browser fetch on a redirect.
Headless browser test (puppeteer) of wasm/hf_xet_wasm_download against a real HF CAS endpoint surfaced four runtime panics; all are fixed here. 1. XetSessionBuilder::build on wasm called tokio::runtime::Handle::current, which panics when invoked from a JS callback with no enclosing runtime. Switch to XetContext::with_config; the wasm path of XetRuntime::new now constructs a stub XetRuntime (the wasm bridge variants .await directly via wasm_bindgen_futures, so the inner tokio runtime is never driven). 2. xet_runtime::core::runtime called std::process::id() at multiple sites for fork detection; wasm32-unknown-unknown's libstd panics there. Add a current_pid() helper that returns 0 on wasm. 3. xet_core_structures::xorb_object::compression_scheme used std::time::Instant for compression timing telemetry. On the LZ4 decompression path this panics on wasm. Use web_time::Instant which aliases std on native and exposes performance.now() on wasm. 4. ExpWeightedMovingAvg, speed_tracker, and the adaptive_concurrency controller used tokio::time::Instant. On wasm32 that falls through to std::time::Instant::now() and panics. Use tokio::time::Instant on native (so the existing tests' tokio::time::advance simulation still works) and web_time::Instant on wasm via cfg. 5. xet_client adaptive_concurrency controller spawned a partial-completion reporter task via tokio::spawn, which requires being inside a tokio runtime context. Switch to the wasmtokio alias so it uses wasm_bindgen_futures::spawn_local on wasm. Verified end-to-end with a puppeteer test that downloads an 11 MB Xet-backed file from the Qwen/Qwen-Image-Edit repo via the WASM bindings; bytes streamed match the expected size exactly.
…README
- Delete docs/design/2026-05-12-xet-pkg-wasm-download-{design,plan}.md now
that the feature is implemented.
- Add a short "WebAssembly compatibility" section to the root README
listing the patterns this codebase relies on (web_time::Instant,
tokio_with_wasm::alias, conditional ?Send async-traits, filesystem
gating) so future contributors and AI agents touching xet_pkg /
xet_client / xet_data / xet_core_structures / xet_runtime know not to
regress the wasm build.
- Add a README to wasm/hf_xet_wasm_download describing the JS API, the
HF Hub endpoints the example calls, and how to do a local browser
test, plus a maintainer pointer to the root README.
Pin updates across all three wasm crates, build scripts, READMEs, and the build-wasm GitHub Action. Lockfiles for hf_xet_wasm, hf_xet_thin_wasm, and hf_xet_wasm_download regenerated; js-sys, web-sys, wasm-bindgen-futures, and wasm-bindgen-test pick up matching versions.
97f9ebf to
989e522
Compare
Runs the built wasm against prod hub + CAS in headless Chromium and asserts byte count + SHA-256 of a pinned reference file from xet-team/xet-spec-reference-files. Skipped on forks (no token); continue-on-error so a hub blip doesn't block PRs.
The two browser-facing crates are CI smoke wrappers, not published SDKs; say so consistently across both READMEs, both crate-level `//!` docs, the root README, and the api_changes doc. Real browser consumers should depend on `hf-xet` directly with their own `#[wasm_bindgen]` glue. The `hf_xet_wasm_download` README still documented the pre-builder API (`new XetSession(endpoint, token, exp)` → `downloadStream(...)`); rewrite it to match the actual session / group split implemented in the PR. In `xet_pkg/src/lib.rs`, the top-level doctest uses `_blocking` / `upload_from_path` / `download_file_to_path` unconditionally — all non-wasm-only. Add a "WebAssembly targets" section pointing wasm consumers at the async entry points (`new_upload_commit` + `upload_bytes`/`upload_stream`, `new_download_stream_group` + `download_stream`) and at the example crates, and reference the api_changes doc for the full set of wasm-only differences. Use plain text rather than intra-doc links for `legacy` and `new_file_download_group` so wasm-target rustdoc builds don't emit unresolved-link warnings for items that are cfg'd out on wasm.
Both example crates' `validate_session_inputs` documented `0` as "no
expiry", but the inner `AuthConfig::maybe_new` only treats *missing*
expiry as no-expiry (defaults to `u64::MAX`); an explicit `Some(0)` is
preserved as-is, after which `TokenProvider::is_expired()` always
returns true (`0 <= cur_time + REFRESH_BUFFER_SEC`) and the next
request fails with an auth error because this wrapper does not wire a
token refresher.
Map `0` to `u64::MAX` at the JS boundary so the documented sentinel
actually works for placeholder / local-only flows (the CI upload smoke
uses `0` because it never makes a CAS round-trip). Reword the doc
comments and READMEs to spell out that any *non-zero* value at or
before "now" still fails — the only safe inputs in production are real
`exp` values from the Hub `xet-{read,write}-token` response.
…o hf_xet_wasm Merge the two example/smoke-test crates into a single wasm/hf_xet_wasm/ crate exposing one XetSession with both newUploadCommit and newDownloadStreamGroup. The shared helpers (js_err, validate_session_inputs) now live in src/common.rs instead of being copy-pasted, and the download files are renamed to download_group.rs / download_stream.rs for clarity in the combined module tree. CI (build-wasm action, cache action, Cargo.lock freshness check), the ci-smoke pages, the api_changes doc, the root README, and the xet_pkg doc-comment are updated to reference the single crate. Build script, wasm-bindgen pin, and the JS surface are otherwise unchanged.
Drop the patch component from version requirements that weren't using ^ or = explicitly. The lock file resolves a specific patch anyway, so trimming to major.minor lets the lock regenerate against whatever the local registry happens to have without breaking the version spec. Also regenerates wasm/hf_xet_wasm/Cargo.lock against the slightly older mirror set (ctor 1.0.5, tower-http 0.6.10) so ./build_wasm.sh works on machines using the internal HF crates mirror.
The wasm SystemMonitor only surfaced what the browser exposes via `navigator` / `performance` and was never wired up to anything useful, so remove it rather than maintain a second implementation. On wasm, `SystemMonitor` is no longer compiled and `XetRuntime` does not hold or start one. The `system_monitor` config group still compiles on all targets so config keys round-trip; the runtime simply does not read it on wasm.
Per review feedback: collapse the two cfg-gated `let request = ...` bindings into one expression so the native and wasm alternatives are visually paired.
…cache Wasm previously stubbed `query_dedup_shard_by_chunk` to `Ok(false)` because the native path imports fetched dedup shards into a disk-backed `ShardFileManager`, which is incompatible with the wasm32-unknown-unknown sandbox. The CAS-side query itself was already wasm-compatible — only the shard-import side was missing. Now the wasm `SessionShardInterface` keeps a bounded LRU `VecDeque` of fetched shards (cap 32, ~few MB of metadata), parsed in-process via the new `MDBInMemoryShard::from_reader` helper. The cache is kept separate from `session_shard` so server-side shard metadata is never re-uploaded via `upload_and_register_session_shards`. `chunk_hash_dedup_query` walks the session shard first, then the LRU; cache hits bump the matched shard to most-recently-used and report `already_uploaded=true` so callers skip the xorb upload. The unused `#[allow(dead_code)]` on `ctx` is dropped — it's now read in `query_dedup_shard_by_chunk` for `config.data.default_prefix`.
…dmines
Replace the placeholder upload smoke (tokenExpiry=0 sentinel, no CAS
round-trip) with end-to-end variants that mint a real xet-write-token
from the Hub for xet-team/xet-wasm-test and exercise commit() (xorb +
shard push to CAS). Cover bytes vs stream and single vs multi-file
(3 concurrent uploads per commit) so the per-file aggregation and the
xorb-task fan-out are both on the smoke surface.
Bugs the strengthened smokes surfaced and this commit also fixes:
- xet_data/progress_tracking/upload_tracking.rs:391 used bare
tokio::spawn, which panics on wasm ("no reactor running"). Cfg-gate
use tokio_with_wasm::alias as tokio so the same call resolves to
the wasm-compatible shim, matching the pattern in adjacent files.
- xet_data/processing/file_upload_session::finalize_impl took
deduplication_metrics before joining xorb_upload_tasks. The spawned
tasks update session.deduplication_metrics only after their CAS
request resolves, so a take() before join drops their writes into
the empty replacement. On native multithreaded the tasks usually
completed during the prior .await; on wasm (single-threaded) they
were still in flight, surfacing as xorb_bytes_uploaded == 0 in the
smoke. Reorder: join first, then take.
API change in the wasm wrapper:
- validate_session_inputs now rejects tokenExpiry <= 0 instead of
rewriting 0 to u64::MAX. Callers must pass the real exp from the
Hub response; the sentinel only existed to keep the placeholder
smoke alive and is no longer needed.
CI: four new non-blocking steps in build_and_test-wasm covering
upload (bytes/stream) x (single/multi-file), all sourcing
HF_SMOKE_TEST_TOKEN || HF_TOKEN.
Untracks .vscode/settings.json (previously shipped with a default rust formatter pin) so local editor state no longer leaks into git status.
- xorb_block: skip chunk-cache read/write on wasm. Wasm builds have no disk-backed ChunkCache, so the cache_key construction (and the put spawn) are unreachable; cfg-gate them out so the import of Key is also wasm-only and the chunk_cache arg becomes #[allow(unused)] under wasm. - xet_runtime/utils/mod.rs + file_paths_wasm.rs: extract the wasm TemplatedPathBuf stub to its own file. mod.rs now just re-exports from file_paths or file_paths_wasm depending on target; cleaner than inlining the stub. - xet_runtime/utils/rw_task_lock: add the same cfg-gated `use tokio_with_wasm::alias as tokio;` shim used elsewhere, so the tokio::spawn / JoinHandle types resolve correctly on wasm.
Adds focused smoke pages + runners for surfaces the existing smokes didn't reach. Each runs against xet-team/xet-wasm-test (except invalid-inputs, which is local-only), sources HF_SMOKE_TEST_TOKEN || HF_TOKEN, and is wired into build_and_test-wasm: - invalid-inputs: validates validate_session_inputs rejects bad token, endpoint, and tokenExpiry inputs across both newUploadCommit and newDownloadStreamGroup. Blocking (no network) — a regression here would silently weaken the validation surface. - download-multi: two concurrent downloadStream calls from one XetDownloadStreamGroup against pytorch_model.bin + tf_model.h5 on the pinned tiny-random-bert commit. Size delta (540 KiB vs 26 MiB) makes any stream-fan-out buffer crossover unambiguous. - dedup: 65 MiB payload uploaded twice as two files in one commit. The 65 MiB size forces an xorb cut between the two uploadBytes calls (MAX_XORB_BYTES is 64 MiB), so the second file's chunks dedup against the first xorb's entries in session_shard. Smaller payloads don't work: both files end up co-resident in current_session_data with the session_shard still empty. - upload-tiny: 0-byte + 1-byte + 64 KiB files in one commit. Hits the empty-xorb suppression path and the no-chunks branch in the chunker — common bug nurseries. - upload-mixed: uploadBytes + uploadStream concurrently on the same XetUploadCommit, then commit(). Catches handle-tracking bugs where one path's metadata clobbers the other's before commit() aggregates them. - upload-multi-commit: two sequential XetUploadCommits from one XetSession. Each commit constructs its own FileUploadSession; catches XetSession-level leaks (task runtime cleanup, locked state) that would manifest as the second newUploadCommit() hanging. Note on global dedup: a deterministic test of the CAS-indexed dedup path needs a chunk hash satisfying `hash % 1024 == 0`, which random 1 MiB payloads only produce ~1.6% of the time. The session-shard branch of chunk_hash_dedup_query (exercised by the dedup smoke) is on the same surface and covers that regression class.
…drift - server.mjs path-prefix check now uses a `path.sep` boundary so a sibling dir like `<root>-other/` is no longer accepted alongside `<root>/`. - New `wasm/check-version-pins.sh` asserts that every place pinning wasm-bindgen / wasm-pack (workspace Cargo.toml, the two wasm-subcrate Cargo.tomls, build_wasm.sh's WASM_BINDGEN_VERSION, and the action's cargo install + cache key) agrees on a single version. Wired as the first step in the build-wasm action so a partial bump fails CI before any install or build runs.
…pile # Conflicts: # wasm/hf_xet_wasm/src/wasm_file_cleaner.rs # xet_client/src/cas_client/remote_client.rs # xet_data/src/processing/mod.rs
assafvayner
left a comment
There was a problem hiding this comment.
claude generated — inline notes documenting the non-gating behavioral changes in this PR (the "genuine logic changes"). These are the only changes that alter runtime behavior beyond cfg-gating; flagging them for reviewer attention.
| result??; | ||
| } | ||
|
|
||
| let mut metrics = take(&mut *self.deduplication_metrics.lock().await); |
There was a problem hiding this comment.
claude generated — Correctness fix: take(deduplication_metrics) was moved to run after the xorb_upload_tasks join loop above. Each xorb-upload task records its transmitted-byte count into deduplication_metrics only after its CAS request resolves, so taking the metric before the join snapshots an empty DeduplicationMetrics and silently drops those late writes. On native (multi-threaded) the tasks usually finish before this point, masking the race; on single-threaded wasm it reliably surfaced as xorb_bytes_uploaded == 0. This also closes a latent native race — and is safe because no new upload tasks are queued after the join.
| .xorb | ||
| .simulation_max_bytes | ||
| .map(|bs| (bs.as_u64() as usize).min(*MAX_XORB_BYTES)) | ||
| .map(|bs| bs.as_u64().min(*MAX_XORB_BYTES as u64) as usize) |
There was a problem hiding this comment.
claude generated — Correctness fix (32-bit truncation): the order was swapped from (bs.as_u64() as usize).min(*MAX_XORB_BYTES) to bs.as_u64().min(*MAX_XORB_BYTES as u64) as usize. The old form cast a potentially huge u64 to usize before clamping, which truncates on wasm32's 32-bit usize. Clamping in u64 space first guarantees the result fits (MAX_XORB_BYTES = 64 MiB). Same change applied at file_upload_session.rs:442.
| .xorb | ||
| .simulation_max_bytes | ||
| .map(|bs| (bs.as_u64() as usize).min(*MAX_XORB_BYTES)) | ||
| .map(|bs| bs.as_u64().min(*MAX_XORB_BYTES as u64) as usize) |
There was a problem hiding this comment.
claude generated — Correctness fix (32-bit truncation): same fix as file_deduplication.rs:99 — clamp simulation_max_bytes against MAX_XORB_BYTES in u64 space before the as usize cast, so wasm32's 32-bit usize cannot truncate a large configured value.
| // pass the raw Bytes directly (CONTENT_LENGTH is set by reqwest from the body length). | ||
| #[cfg(target_family = "wasm")] | ||
| { | ||
| client.post(url).with_extension(Api(api_tag)).body(serialized_data) |
There was a problem hiding this comment.
claude generated — Behavior change: native and wasm upload_xorb are now a single code path. The previous separate wasm impl posted the body directly and skipped both the connection permit and RetryWrapper/adaptive-concurrency. wasm now flows through the same retry + permit logic; it differs only in the request body (raw Bytes, since reqwest's wasm backend can't stream request bodies — CONTENT_LENGTH is set by reqwest from the body) and emits a single bulk report_progress(n_transfer_bytes) after success instead of per-chunk streaming updates. Net improvement, but note the wasm upload path now newly exercises the retry/adaptive-concurrency machinery.
| /// hits and `true` for dedup-cache hits, since cache entries come from | ||
| /// shards the server already has. Cache hits also bump the matched | ||
| /// shard to the most-recently-used position. | ||
| pub async fn chunk_hash_dedup_query( |
There was a problem hiding this comment.
claude generated — Correctness note: the (usize, FileDataSequenceEntry, bool) "already uploaded" contract matches native (shard_interface/native.rs::chunk_hash_dedup_query): session-shard hit → false, global-dedup-cache hit → true. wasm correctly omits native's third "resumed session" branch (no resume support on wasm) and mirrors only the two branches that apply. A miss here costs only a re-upload (never corruption), so the in-memory + bounded-LRU divergence from native's on-disk shard managers is low-risk for correctness.
Add job-level continue-on-error to build_and_test-wasm so a wasm failure no longer fails the overall CI run or gates merges/releases. The job still runs the full wasm suite (version-pin check, wasm builds, the cargo check -p hf-xet compile gate, and all 11 browser smokes) on every PR and main push; genuine regressions surface as a red "Build WASM" job (visible signal) while the run stays green. The per-step continue-on-error on the network smokes keeps the job green for hub/CAS blips, so a red job means a real wasm regression.
…arness
- dedupe TaskRuntime::bridge_async{,_finalizing} across targets via a
MaybeSend shim; only run_inner_async and bridge_sync* stay cfg-gated
- collapse the duplicated wasm variants of XetSessionBuilder::build and
TranslatorConfig::new into single fns with a cfg'd binding
- wasm shard dedup cache: drop LRU recency machinery for a bounded FIFO
- fold system_monitor into all_config_groups! now that its wasm gate is gone
- replace XetConfig::validate_usize_bounds with usize::try_from at the
ingestion_block_size cast sites
- unify wasm cfg predicate in async_read.rs (target_arch -> target_family)
- consolidate wasm/ci-smoke into a shared harness (run.mjs scenario table +
harness.html + common.mjs + scenarios/*.mjs), replacing 22 near-duplicate
runner/page files; ci.yml steps and blocking semantics unchanged
- drop wasm/check-version-pins.sh and its build-wasm action step
Production CAS returns HMAC-keyed shards from query_for_global_dedup_shard: the chunk hashes in the shard are keyed with a per-shard key stored in the footer, which MDBInMemoryShard::from_reader skips. The wasm dedup cache was probing keyed lookup tables with raw chunk hashes, so global dedup silently never matched against prod (while working in unkeyed local simulation). Cache entries now carry the shard's MDBShardInfo (header + footer): probe hashes are keyed with chunk_hmac_key() per shard, mirroring the native shard_file_manager, and entries past shard_key_expiry are skipped (0 = no expiry, web_time for the wasm clock). Adds a native unit test building a keyed shard via export_as_keyed_shard_streaming and asserting raw probes miss while keyed probes hit — the only CI-visible guard, since simulation shards are unkeyed.
… xorbs) Downloads the pre-seeded deterministic 16 MiB file from xet-team/xet-wasm-test and re-uploads its bytes in a fresh commit, asserting the chunk-0 global-dedup query fully dedups the re-upload: new_bytes == 0, xorb_bytes_uploaded == 0, shard still pushed. This is the end-to-end regression guard for the HMAC-keyed shard lookup — if the keying regresses, deduped_bytes_by_global_dedup drops to 0 and a full payload of xorbs is pushed. Deterministic because prod CAS indexes the first chunk of every uploaded file and the client always queries chunk 0. First run against an unseeded repo bootstraps: uploads the xorshift32 seed payload and commits it via the Hub NDJSON commit API so paths-info sees it and GC keeps its xorbs. Verified against prod: bootstrap + assert runs pass; with the HMAC fix reverted the scenario fails with the expected signature. Non-blocking CI step like the other network smokes.
…as-casts Scope cleanup of 65ab8ab: keep pre-existing main code untouched where the change had no wasm effect. - xet_runtime config (groups/mod.rs, macros.rs, xet_config.rs): back to main's explicit per-consumer system_monitor appends with their cfg(not(wasm)) gates; the group is absent from the config on wasm as on main. Also drops the PR's validate_usize_bounds (net-zero vs main). - data_client.rs, file_upload_session.rs: restore main's `as usize` casts of ingestion_block_size in the four cfg(not(wasm))-only fns (clean_file, hash_files_async, upload_files, feed_file_to_cleaner) — the try_from guard is meaningless on 64-bit-only code. The wasm-compiled site in file_cleaner.rs keeps the guard. - api_changes doc updated to match.
…pload lifecycle, sha256 policy) - download-range: prefix/mid/suffix byte-range downloads vs reference slices - download-cancel: cancel() mid-stream, group must stay usable - download-error: nonexistent hash + malformed fileInfo must reject, not hang - upload-lifecycle: post-abort/post-commit misuse must reject (wasm mirror of native upload_commit state-machine tests) - sha256-policy: compute/provided/skip metadata + parse_sha256_policy rejects - download-multi: verify per-file content sha256 against pinned values Also make smoke failures visible in CI: drop step-level continue-on-error (the job-level flag still keeps the overall run green and gates nothing), and absorb hub/CAS blips with a single in-runner retry in run.mjs instead.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 585d975. Configure here.
| `dedup_metrics.xorb_bytes_uploaded=${dedup.xorb_bytes_uploaded}, expected < 1.5 * ${EXPECTED_PAYLOAD_SIZE} ` + | ||
| `(only one payload's worth of bytes should hit CAS): ${JSON.stringify(dedup)}`, | ||
| ); | ||
| } |
There was a problem hiding this comment.
Dedup smoke misses zero xorb
Low Severity
The dedup smoke only upper-bounds commitReport.dedup_metrics.xorb_bytes_uploaded and never requires a positive upload count. When xorb_bytes_uploaded is zero (e.g. metrics taken before xorb upload tasks finish), the check still passes, so the scenario can miss the regression this table comments are meant to guard.
Reviewed by Cursor Bugbot for commit 585d975. Configure here.
seanses
left a comment
There was a problem hiding this comment.
first batch of comments
| ~/.cargo/bin/wasm-pack | ||
| ~/.cargo/.crates.toml | ||
| ~/.cargo/.crates2.json | ||
| key: ${{ runner.os }}-${{ runner.arch }}-cargo-tools-wbg-0.2.121-wp-0.14.0-rustc-${{ steps.rustc-version.outputs.version }} |
There was a problem hiding this comment.
"wasm-bindgen-cli" and "wasm-pack" are stable native host binaries that don't depend on nightly-only features. So embedding in the rustc nightly version is useless: compilation produces the same artifact as long as the binary release version is identical (e.g. 0.2.121).
| - name: Check hf-xet compiles for wasm32-unknown-unknown | ||
| shell: bash | ||
| run: | | ||
| CARGO_TARGET_WASM32_UNKNOWN_UNKNOWN_RUSTFLAGS="-C target-feature=+atomics,+bulk-memory,+mutable-globals --cfg getrandom_backend=\"wasm_js\"" \ |
There was a problem hiding this comment.
Let's similarly put the build details under the "hf-xet" package (xet_pkg) "build_wasm.sh". And comparing this to "wasm/hf_xet_wasm/build_wasm.sh", a lot of RUSTFLAGS seem missing.
| hf_xet/target | ||
| wasm/hf_xet_wasm/target | ||
| wasm/hf_xet_thin_wasm/target | ||
| wasm/hf_xet_wasm/target |
There was a problem hiding this comment.
Seems meaningless reorder but alright..
| # Non-blocking: this job runs the full wasm test suite (version-pin check, | ||
| # wasm builds, the `cargo check -p hf-xet` compile gate, and all browser | ||
| # smoke tests) on every PR and main push, but a failure here must never | ||
| # fail the CI run or gate merges/releases. wasm is an additive target, so | ||
| # regressions surface as a red "Build WASM" job (visible signal) without | ||
| # turning the overall run red. Nothing `needs:` this job and the release | ||
| # workflows are independent, so it blocks no other jobs or releases either. | ||
| # (The per-step `continue-on-error` on the network smokes below keeps the | ||
| # job green for hub/CAS blips, so a red job means a real wasm regression.) | ||
| continue-on-error: true |
There was a problem hiding this comment.
This is contradictory to our expectation: if we ship WASM compatible hf-xet we don't want to break it in any version.
| test -z "$(git status --porcelain Cargo.lock)" || (echo "hf_xet_wasm Cargo.lock has uncommitted changes!" && exit 1) | ||
| - uses: actions/setup-node@a0853c24544627f65ddf259abe73b1d18a591444 # v5.0.0 | ||
| with: | ||
| node-version: '20' |
There was a problem hiding this comment.
Use "24". Github actions are phasing out "20", so should we
| The previous separate `wasm/hf_xet_wasm_download` and `wasm/hf_xet_wasm_upload` | ||
| crates have been combined back into `wasm/hf_xet_wasm`. |
There was a problem hiding this comment.
Looks like this talks about some intermediate work in this PR, ask AI to trim such statements.
| files from disk, and `hash_files_async` additionally routes through | ||
| `XetRuntime::spawn_blocking` (which on wasm runs `f` inline anyway — | ||
| unsuitable for the parallel-hash use case). Wasm consumers that need | ||
| hashing must drive it from JS (e.g. `crypto.subtle.digest`) or feed |
There was a problem hiding this comment.
What is crypto.subtle.digest? That's not the Xet file hash.
|
|
||
| [lib] | ||
| crate-type = ["cdylib", "rlib"] | ||
| crate-type = ["cdylib"] |
There was a problem hiding this comment.
What's the reasoning behind removing "rlib"?
|
|
||
| #[cfg(not(target_family = "wasm"))] | ||
| mod file_paths; | ||
| #[cfg(target_family = "wasm")] |
There was a problem hiding this comment.
Instead of this, I'm thinking to completely cfg-gate XetConfig values that is of type "TemplatedPathBuf", so we don't need to provide a dummy "TemplatedPathBuf" for wasm which is never used.
| ] } | ||
| tokio_with_wasm = { workspace = true, features = ["rt", "sync", "time"] } | ||
| web-time = { workspace = true } | ||
| wasm-bindgen = { workspace = true } |
There was a problem hiding this comment.
Does "xet_runtime" need "wasm-bindgen", "wasm-bindgen-futures" and "js-sys"? I thought they are used to generate JS bindings, i.e. in "wasm/*"
| .enabled | ||
| .then(|| { | ||
| SystemMonitor::follow_process(config.system_monitor.sample_interval, config.system_monitor.log_path.clone()) | ||
| .ok() |
There was a problem hiding this comment.
If we want to inspect the error, we can just add .inspect_err(|e| debug!(...)) before .ok(), instead of rewritting the function
| @@ -3,15 +3,18 @@ use std::collections::HashMap; | |||
| use std::fmt::Display; | |||
| use std::future::Future; | |||
| use std::panic::AssertUnwindSafe; | |||
There was a problem hiding this comment.
This file contains so many cfg-gated code and it's becoming difficult to read, and it still contains many code path that won't be used by WASM at all (e.g. all member fields of XetRuntime). A XetRuntime for WASM target can be extremely simple, how about splitting XetRuntime for WASM into a separate file, and keeping this only for native target: #873


Closes #840
Summary
Makes
hf-xet(xet_pkg) and its streaming download + upload data-prep paths compile and run onwasm32-unknown-unknown, and reworkswasm/hf_xet_wasm/into a single unified example/smoke-test JS-binding crate (cdylib) that exposes both the download and upload flows from one wasm module. The earlier split into separatehf_xet_wasm_download/andhf_xet_wasm_upload/crates was reverted (seeff0c46b8) — one wasm module is enough for the CI smokes and easier for the example pages to share.The existing thin published crate
wasm/hf_xet_thin_wasm/is unchanged in scope but follows the new shared wasm-bindgen pinning.Two layers of changes
1. Compile compatibility (commits 12274c0 → 7e3c692): cfg-gated
tokiodep split,task_runtime.rsWASM bridge variants,xet_pkg::legacyand_blocking/path-based methods gated to non-wasm, cache gating inxet_client.2. Streaming runtime compatibility (commits 5b7c26e → 28b9265): make the entire streaming download path
!Send-tolerant so it actually executes on WASM (where reqwest's futures are!Send).A follow-up simplification pass (
65ab8aba) deduplicates the wasm/native split where the bodies were identical:TaskRuntime::bridge_async{,_finalizing}are written once via aMaybeSendshim,XetSessionBuilder::buildandTranslatorConfig::neware single fns with a cfg'd binding, the wasm shard dedup cache is a plain bounded FIFO, and the ci-smoke suite runs from one shared harness.Key changes (runtime compatibility refactor)
tokio_with_wasmdependency: added to[workspace.dependencies]and toxet_pkg/xet_data/xet_runtime/xet_clientwasm-target deps. On native, call sites use realtokio; on wasm,use tokio_with_wasm::alias as tokioredirectsspawn/JoinSet/select!/syncto the wasm-bridged variants that usewasm_bindgen_futures::spawn_localinternally, so spawned futures don't needSend.?Sendasync_trait:URLProvider(xet_client),DataWriter(xet_data),XorbURLProvider,SequentialWriter,UnorderedWriter,UploadSessionDataManagerimpls. Trait bounds ondyn TstaySend + Sync; only the futures loseSendon wasm.tokio_with_wasm::alias as tokioshims: every spawn site in the streaming path (download_stream.rs,unordered_download_stream.rs,sequential_writer.rs,unordered_writer.rs,file_term.rs,manager.rs,xet_pkg/upload_*,xet_data/file_upload_session, etc.) gets the wasm-only import.SyncWriterThread,SequentialWriter::new(disk),DownloadStream::blocking_next,UnorderedDownloadStream::blocking_next, the chunk-cachetokio::spawninxorb_block.rs, every_blockingmethod on session types,XetSession::sigint_abort,XetSessionBuilder::with_tokio_handle, and the entirexet_pkg::legacymodule are all#[cfg(not(target_family = "wasm"))].TaskRuntime::bridge_asyncandbridge_async_finalizing(the task-state machine) exist once for both targets via aMaybeSendshim trait (Sendsupertrait on native, unconstrained on wasm); onlyrun_inner_async(XetRuntime offload vs inline select) and the native-onlybridge_sync*pair are cfg-gated.xet_runtimeadjustments: un-gatedClosureGuard(pure Rust). Movedtracing-appenderto non-wasm deps and gated theinitsubmodule.XetRuntime::newhas a wasm stub (no tokio runtime created — bridge variants intask_runtime.rs.awaitdirectly viawasm_bindgen_futures).XetRuntime::spawn_blockingcfg-gates totokio_with_wasm::task::spawn_blockingon wasm.XetRuntime::{handle, num_worker_threads, spawn, bridge_async, bridge_sync, external_run_async_task}gated to native — wasm callers route throughtask_runtime.rsbridge variants instead, and gating surfaces this as a compile-time error rather than a runtime panic. Dropped the wasmSystemMonitorand gated the sysinfo-based native one off on wasm. Thesystem_monitorconfig group keeps its pre-existing native-only gating (absent from the config on wasm).FileReconstructor::{reconstruct_to_file, reconstruct_to_writer},FileDownloadSession::{download_file, download_file_with_id, download_file_background, download_to_writer},FileUploadSession::{upload_files, spawn_upload_from_path, feed_file_to_cleaner},XetUploadCommit::upload_from_path.xet_data::processingandxet_data::file_reconstructionun-gated.FileDownloadSession,FileUploadSession,XetFileInfo,Sha256Policy,DownloadStream,UnorderedDownloadStream,ChunkCache,CacheConfig,create_remote_clientare now available on wasm.TranslatorConfig::newis a single fn that skips filesystem directory creation on wasm. Newshard_interface/wasm.rsprovides an in-memoryMDBInMemoryShard-backedSessionShardInterfacewith a bounded FIFO dedup cache (no disk staging, no resume).xet_pkgcleanup: reverted the over-aggressive gates from earlier in this PR.XetDownloadStreamGroup,XetDownloadStream,XetSession::new_download_stream_group,XetSession::new_upload_commit,XetUploadCommit::{upload_bytes, upload_stream, commit, abort},XetStreamUpload,XetFileUpload,XetFileInfo,Sha256Policy, andFileReconstructionErrorare all on wasm.wasm/hf_xet_wasm/(unified example crate,cdylib): wrapsxet::xet_session::XetSessionwith#[wasm_bindgen]and exposes both the upload and download flows from one wasm module. The JS surface mirrors the Rust builder pattern —XetSessionis auth-free; auth lives on the per-commit / per-group builder.new XetSession()+XetSession.newDownloadStreamGroup(endpoint, token, tokenExpiry) -> Promise<XetDownloadStreamGroup>+XetDownloadStreamGroup.downloadStream(fileInfo, byteRangeStart?, byteRangeEnd?) -> Promise<XetDownloadStream>+XetDownloadStream.next() -> Promise<Uint8Array | undefined>+cancel().XetSession.newUploadCommit(endpoint, token, tokenExpiry) -> Promise<XetUploadCommit>+XetUploadCommit.{uploadBytes, uploadStream, commit, abort}+XetStreamUpload.{write, finish, abort}.build_wasm.sh+examples/download.html+examples/upload.htmlbrowser test pages. This is not a published browser SDK — real consumers should depend onhf-xetdirectly with their own#[wasm_bindgen]glue, or use a downstream SDK such ashf-hub..github/actions/build-wasm/action.yml,cargo +nightly check --target wasm32-unknown-unknown -p hf-xetcompile gate, Cargo.lock freshness checks, plus a 12-scenario headless-Chromium Playwright smoke matrix inwasm/ci-smoke/(see Testing). The smokes run from a single shared harness:node run.mjs <scenario>drives oneharness.htmlthat dynamically imports the per-scenario module fromwasm/ci-smoke/scenarios/, with shared token/assert helpers incommon.mjs— each scenario file holds only its scenario-specific logic.WASM API surface (downloads)
WASM API surface (uploads)
The unified
hf_xet_wasmcrate exposes the JS classesXetSession,XetDownloadStreamGroup,XetDownloadStream,XetUploadCommit,XetStreamUpload, andXetFileUploadfrom a single wasm module.Testing
hf-xet(95 unit + 6 doctest) andxet-data(283 unit) tests passcargo +nightly check --target wasm32-unknown-unknown -p hf-xetpasses (CI)wasm/hf_xet_wasm/build_wasm.shandwasm/hf_xet_thin_wasm/build_wasm.shproducepkg/{js,d.ts,wasm}(CI)wasm/ci-smoke/, each invoked asnode run.mjs <scenario>):invalid-inputs— local-only wasm-side input validation (blocking, no Hub or CAS).download— single-file download from prod hub + CAS, asserts byte count + SHA-256 (non-blocking, hub blips don't fail PRs).download-multi— concurrent multi-file download in oneXetDownloadStreamGroup(non-blocking).upload—uploadBytes+commit()(xorb + shard push to CAS) — regression guard for the wasmXetRuntime::spawn_blockingpanic (non-blocking).upload-stream— streaming variant ofupload(uploadStream+XetStreamUpload::{write, finish}) (non-blocking).upload-multi— concurrent multi-fileuploadBytesin one commit (non-blocking).upload-stream-multi— concurrent multi-fileuploadStream(non-blocking).upload-mixed—uploadBytes+uploadStreamconcurrently in one commit (non-blocking).upload-tiny— 0-byte / 1-byte / 64 KiB files in one commit; catches regressions to the empty-xorb / no-chunks path (non-blocking).upload-multi-commit— two sequentialXetUploadCommits from oneXetSession; catches XetSession-level resource leaks (non-blocking).dedup— session-level in-commit dedup: two identical 65 MiB files trigger a cross-xorb dedup hit (non-blocking).global-dedup— downloads the pre-seeded deterministic 16 MiB file from the test repo and re-uploads it; asserts the HMAC-keyed global-dedup shard lookup fully dedups the re-upload (xorb_bytes_uploaded === 0, no new xorb) (non-blocking).wasm/hf_xet_wasm/examples/download.htmlorwasm/hf_xet_wasm/examples/upload.htmlwith a static server (the smokeserver.mjssets the required COOP/COEP headers).Note
High Risk
Touches core async/I/O paths for a new target and runs prod Hub/CAS smokes; regressions affect browser consumers but the wasm job does not block merges (
continue-on-error).Overview
Adds
wasm32-unknown-unknownsupport for the corehf-xetstack via workspace deps (tokio_with_wasm, pinnedwasm-bindgen0.2.121), lockfile updates, and documented wasm-only API/runtime patterns in README andapi_changes/update_260515_wasm_target_support.md.CI / build:
.github/actions/build-wasmnow caches wasm tool binaries (keyed on nightlyrustc), bumps wasm-bindgen-cli / wasm-pack, runscargo +nightly check -p hf-xetfor wasm, and reorders rust cache paths. The Build WASM job iscontinue-on-error: truebut runs a large Playwright matrix underwasm/ci-smoke/(shared harness, many upload/download/dedup/lifecycle scenarios against prod Hub/CAS when tokens are set).Example crates:
wasm/hf_xet_wasmbecomes a thincdylibJS wrapper aroundhf-xet(not a published SDK);hf_xet_thin_wasmaligns bindgen versions andcdylib-only output..gitignoredrops checked-in VS Code settings and ignores editor/node_modulestrees.Reviewed by Cursor Bugbot for commit 585d975. Bugbot is set up for automated code reviews on this repo. Configure here.