Skip to content

WGPU/Metal backend, pure-env config loader, nextest migration (MULTI-1407)#5

Merged
RobbieMcKinstry merged 3 commits into
trunkfrom
robbie/multi-1407
Jun 29, 2026
Merged

WGPU/Metal backend, pure-env config loader, nextest migration (MULTI-1407)#5
RobbieMcKinstry merged 3 commits into
trunkfrom
robbie/multi-1407

Conversation

@RobbieMcKinstry

Copy link
Copy Markdown
Contributor

Summary

Three stacked commits on trunk:

  • 043116d — Add WGPU/Metal backend with backend-aware precision (MULTI-1407). New wgpu crate feature; selection order is cuda > wgpu > ndarray. Precision is now backend-aware: bf16 on CUDA (the locked cloud recipe — MULTI-1386), f32 on WGPU (Metal has no bf16 arithmetic over WGPU) and ndarray. Crate-root #![recursion_limit = "256"] bump for CubeCL's #[cube] macros. Forward/backward smoke test on the active backend, verified locally on Metal with cargo test --no-default-features --features wgpu.
  • a6240ce — Make the config loader pure of std::env; assert no-flake testing policy. Two pre-existing loader tests flaked under default-parallel cargo test because figment::Jail::set_env mutates process-global env via std::env::set_var while concurrent tests (without a Jail) read WUBBIE_MODEL_* through figment::Env::prefixed. LayeredConfig::with_env no longer reads process env; read_model_env_overrides() captures std::env::vars() once at the CLI boundary into an EnvOverrides map that is threaded through load_model_config / TrainSubcommand::resolve_model_config. Tests build the map directly and never mutate process env. CLAUDE.md gains a Testing policy section codifying the no-flake rule and the read-ambient-state-at-the-CLI-seam pattern.
  • 57be649 — Migrate test runner to nextest; add bacon-driven monitor task. CI swaps cargo test for cargo nextest run --workspace --locked --no-tests=pass; Makefile.toml gets matching test and monitor tasks (the latter driven by bacon --headless).

Acceptance criteria — MULTI-1407

  • Backend selectable (CUDA / WGPU-Metal / ndarray) without model-code changes — ✅
  • Builds cleanly with the wgpu feature (#![recursion_limit = "256"] applied) — ✅
  • Precision is backend-appropriate: f32 under WGPU/Metal, bf16 path preserved for CUDA — ✅
  • Forward/backward step under WGPU/Metal — ✅ (smoke test runs on Metal locally)

Test plan

  • cargo fmt --all --check
  • cargo clippy --all-targets --workspace --locked -- -D warnings (default ndarray, --features wgpu, --features cuda)
  • cargo build --workspace --locked (default) and cargo build -p wubbie --no-default-features --features {cuda,wgpu} --locked
  • cargo nextest run --workspace --locked --no-tests=pass — 55/55 deterministic across 5 consecutive runs
  • WUBBIE_MODEL_D_FF=9999 cargo nextest run --workspace --locked --no-tests=pass — still 55/55 (confirms the loader is now pure of process env)
  • cargo test --no-default-features --features wgpu --lib backend:: runs the forward/backward smoke on Metal via WGPU
  • CI ⚡ PR Ready green (push, merge-queue contexts)

🤖 Generated with Claude Code

RobbieMcKinstry and others added 3 commits June 29, 2026 14:37
WGPU is wired alongside CUDA behind a new `wgpu` feature, dispatching to
Metal on macOS / Vulkan on Linux / DX12 on Windows. Precision is now
backend-aware: bf16 on CUDA (the locked cloud recipe, MULTI-1386), f32 on
WGPU (Metal doesn't implement bf16 arithmetic over WGPU) and ndarray.
Selection order is `cuda > wgpu > ndarray`; the model and training code
stay backend-generic. Bumps the crate-root recursion limit to 256 for
CubeCL's `#[cube]` macro expansion.

Verified locally with `cargo test --no-default-features --features wgpu`:
the forward/backward smoke runs on Metal via WGPU.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…licy

`LayeredConfig::with_env` no longer reads process env. A new
`read_model_env_overrides()` captures `std::env::vars()` once at the CLI
boundary into an `EnvOverrides` map, which is threaded explicitly through
`load_model_config` and `TrainSubcommand::resolve_model_config`. Tests
build the map directly (`parse_env_overrides`) and never call `set_var`
or `Jail::set_env`.

Before this change, two loader tests flaked under parallel `cargo test`:
`figment::Jail::set_env` mutates process-global env via
`std::env::set_var`, while tests *without* a `Jail` (`Jail` serializes on
a global mutex, but tests not using it skip the lock) read `WUBBIE_MODEL_*`
through figment's `Env::prefixed` provider — so the env-setting test could
inject `D_FF=2222` into a concurrent test's "defaults only" extraction.
Verified deterministic with `WUBBIE_MODEL_D_FF=9999 cargo test` and
5 consecutive default-parallel runs (55/55 pass).

CLAUDE.md gains a "Testing policy" section asserting no flakes are
tolerated and codifying the read-ambient-state-at-the-CLI-seam pattern so
the same class of race can't be reintroduced through a different surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI (`on-push.yml` / `on-merge.yml`) installs `cargo-nextest` via
`taiki-e/install-action@v2` and runs `cargo nextest run --workspace
--locked --no-tests=pass` in place of `cargo test`. `Makefile.toml`'s
`[tasks.test]` is swapped to the matching nextest invocation, and
CLAUDE.md's Build Commands / Testing policy / CI Guardrails sections are
updated to reference `cargo nextest run` so the documented commands match
what CI actually runs.

A new `[tasks.monitor]` is added to `Makefile.toml` so the Claude Code
harness's "run `cargo make monitor` on launch" hook succeeds instead of
failing with `Task "monitor" not found`. It runs `bacon --headless
--summary --no-help-line`. Because wubbie is a virtual workspace, bacon
configuration lives under `[workspace.metadata.bacon]` (rather than
`[package.metadata.bacon]` as in keystore) with `check` (default),
`clippy`, and `test` jobs mirroring the CI gate; the `test` job uses
`analyzer = "nextest"` and `--hide-progress-bar --failure-output final`
so bacon can parse nextest output structurally.

Verified locally with `cargo make test`: 55/55 pass under nextest.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@RobbieMcKinstry RobbieMcKinstry merged commit 2591598 into trunk Jun 29, 2026
3 checks passed
@RobbieMcKinstry RobbieMcKinstry deleted the robbie/multi-1407 branch June 29, 2026 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant