[ci] add PPU unittest workflow #522
Conversation
Adds .github/workflows/unittest_ppu_ci.yml — a workflow_dispatch-only lane that runs `bash scripts/ci/ci_test.sh` inside the `mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/tzrec-test:1.1-ppu` image on the self-hosted `tzrec-ppu-runner`. Uses the same `docker run` shape as master's unittest_h20_ci.yml (avoids the DSW authZ block on the `container:` form's docker.sock injection). No `pull_request` trigger and no PR-number `concurrency:` group — PPU runner capacity is opt-in and manually dispatched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The PPU test image (tzrec-test:1.1-ppu) ships requirements pre-installed, so the ppu lane skips the pip-install preface that ci_test.sh runs and goes straight to gen_proto + ci_data + tzrec/tests/run.py. Keeps ci_test.sh untouched so the existing GPU lane is unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the pull_request trigger (opened/reopened/synchronize) and a PR-number concurrency group, mirroring the other tzrec CI lanes. Revert this commit (or drop the two top blocks) once the PPU runner is stable and we want manual-only dispatch again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PPUs are Alibaba's alixpu accelerators, not NVIDIA GPUs — expose them to the container via explicit --device for /dev/alixpu_ppu0, /dev/alixpu_ppu1, /dev/alixpu, and /dev/alixpu_ctl rather than the nvidia-container-runtime --gpus hook (which the PPU host doesn't have). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PPU test runs exceed the default 6h job timeout (run 26203745164 hit the cap). Bump timeout-minutes to 1440 until the lane is fast enough to fit under the default again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PPU CI run 26218011590 hit `ModuleNotFoundError: No module named
'hstu_attn_2_cuda'` because the PPU image lacks that CUDA-only wheel,
while the existing `gpu_unavailable` skip is False on this lane (the
PPU image still reports torch.cuda.is_available()). Introduce a
`cutlass_hstu_unavailable` probe alongside `gpu_unavailable` in
tzrec/utils/test_util.py that flips when the wheel can't be imported,
and stack it on the four cutlass-only tests:
- tzrec/ops/hstu_attention_test.py: test_attn_cutlass,
test_sla_attn_cutlass
- tzrec/tests/rank_integration_test.py:
test_rank_dlrm_hstu_cutlass_train_eval_export,
test_rank_ultra_hstu_cutlass_train_eval_export
This is an availability gate, not a PPU-named skip — it fires on any
host missing the wheel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PPU CI run 26218011590 reproduced a deterministic ref_dw vs opt_dw mismatch (abs 0.127 on 49.8% of D=257 elements) in test_ln at dtype=bfloat16, norm_type=rms_norm. Bisecting on ty_ppu showed that restricting the `_get_bwd_dwdb_configs` autotune to num_warps in [8, 16] (dropping the num_warps=32 config) makes the test pass deterministically in 12.6s. Master's kernel is byte-identical and works on NVIDIA H20, so this is an alixpu Triton runtime bug at num_warps=32 for this reduction shape; we work around by adding a small `is_ppu_arch()` detector in tzrec/ops/__init__.py and using it to gate the autotune sweep, mirroring how `torch.ops.hip` already restricts the AMD lane. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PPU CI run 26218011590 hit numerical drift in test_addmm (fp32 abs 1.22e-4, rel 5e-6 vs default rtol 1.3e-6) and test_jagged_dense_bmm_broadcast_add_triton (fp32 abs 2.86e-5 on 1.4% of elements). Local repro on ty_ppu confirms these are legitimate ULP-level differences from alixpu Triton's matmul reduction order vs PyTorch eager -- not a kernel bug. Add a `get_compare_tolerance(dtype)` helper to tzrec/utils/test_util.py that returns wider (atol, rtol) on PPU and PyTorch defaults elsewhere, and default the helper into _test_addmm and _test_jagged_dense_bmm_broadcast_add when the caller passes atol=rtol=None. NVIDIA's tight regression guard is preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PPU CI run 26218011590 reported `test_dlrm_hstu_rtp_train_export` and
`test_multi_tower_din_rtp_train_export` failing with `assertTrue(False)`.
Local repro on ty_ppu surfaces two distinct missing-dependency root
causes:
1. `test_dlrm_hstu_rtp_train_export` hits
`ModuleNotFoundError: No module named 'torch_fx_tool'` inside the
export subprocess (export_util.py:671). The PPU image doesn't ship
the torch_fx_tool wheel that RTP export depends on.
2. `test_multi_tower_din_rtp_train_export` hits
`NameError: name 'DynamicEmbScoreStrategy' is not defined` inside
train_eval (dynamicemb_util.py:151) because the test config sets
`dynamicemb { }` on features but the PPU image has no dynamicemb.
The NameError is a latent bug -- `build_dynamicemb_constraints`
references symbols only imported inside the `try: import dynamicemb`
block; on hosts without dynamicemb the function fails with a NameError
instead of a clear message.
Gate both tests on the relevant dependency (mirroring the cutlass
Family-A pattern), and harden `build_dynamicemb_constraints` to raise a
clear RuntimeError when `has_dynamicemb` is False.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PPU fp32 Triton-vs-PyTorch tolerance was conservatively (1e-4, 1e-5). Tightening to (3e-5, 2e-5) gives ~2.1x headroom on test_jagged_dense_bmm (worst observed abs 2.86e-5, |y|~1.53) and ~4.4x on test_addmm (worst abs 1.22e-4, |y|~25.5), verified on ty_ppu. Drop the bf16/fp16 branches that were invented without measurement -- those dtypes fall back to PyTorch defaults. Also trim verbose docstrings/comments per [[feedback_concise_inline_comments]]. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gger Replace the temporary universal pull_request trigger from edaba50 with a permanent branches: ['release/**'] filter. PRs targeting any release branch auto-trigger the PPU lane (where stability matters); PRs to master still need workflow_dispatch (where PPU runner capacity is opt-in). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| def get_compare_tolerance( | ||
| dtype: torch.dtype, | ||
| ) -> Tuple[Optional[float], Optional[float]]: | ||
| """Return (atol, rtol) for Triton-vs-PyTorch comparisons; widen fp32 on PPU.""" | ||
| from tzrec.ops import is_ppu_arch | ||
|
|
||
| if is_ppu_arch() and dtype == torch.float32: | ||
| return (3e-5, 2e-5) | ||
| return (None, None) |
There was a problem hiding this comment.
The (None, None) return path makes the contract awkward — every caller has to special-case it, and the docstring doesn't mention it. Two concrete issues this invites:
- Partial-override blind spot at call sites. Both callsites added in this PR use
if atol is None and rtol is None:(seejagged_tensors_test.py:389,mm_test.py:79). A future caller that passesatol=1e-6, rtol=Nonesilently skips the widening andrtolfalls back to torch's dtype defaults — which on PPU could mask or surface false regressions depending on direction. Widening each independently would be safer:ppu_atol, ppu_rtol = get_compare_tolerance(dtype) atol = atol if atol is not None else ppu_atol rtol = rtol if rtol is not None else ppu_rtol
- Returning
(None, None)to mean "use defaults" is unusual. Consider returning concrete defaults for non-PPU/non-fp32 (e.g., the dtype's standardtorch.testing.assert_closedefaults), or restructure asget_compare_tolerance(dtype, atol, rtol)so the helper owns the precedence rule and callsites stay one line.
At minimum, the docstring should call out the (None, None) return path.
| def is_ppu_arch() -> bool: | ||
| """Return True if a CUDA device is an Alibaba PPU (alixpu) accelerator.""" | ||
| global _is_ppu_arch_cached | ||
| if _is_ppu_arch_cached is None: | ||
| try: | ||
| _is_ppu_arch_cached = torch.cuda.is_available() and any( | ||
| "PPU" in torch.cuda.get_device_name(i) | ||
| for i in range(torch.cuda.device_count()) | ||
| ) | ||
| except Exception: | ||
| _is_ppu_arch_cached = False | ||
| return _is_ppu_arch_cached |
There was a problem hiding this comment.
Two notes on the implementation vs. the docstring:
- Scope is broader than the docstring suggests. The function returns True if any visible CUDA device's name contains the substring
"PPU", and the result is cached for the process lifetime. The docstring reads as if it tests "a CUDA device" — worth tightening to e.g. "Return True if any visible CUDA device name contains 'PPU'. Cached after first call." This also surfaces the mixed-fleet edge case (PPU + non-PPU on one host both report True). except Exceptionis wider than needed. The only realistic failure modes areRuntimeErrorfrom a broken driver andAssertionErrorfrom a stale device index. The broad clause will silently swallow future typos / API changes and quietly disable PPU code paths. Narrow to the specific exceptions, or at least leave a one-line comment naming the failure mode being defended against — otherwise the next reader has to guess.
| concurrency: | ||
| group: unittest-ppu-ci-${{ github.event.pull_request.number }} | ||
| cancel-in-progress: true |
There was a problem hiding this comment.
github.event.pull_request.number is empty for workflow_dispatch, so all manual invocations collapse into the group key unittest-ppu-ci- and cancel-in-progress: true will cancel each other. Suggest:
| concurrency: | |
| group: unittest-ppu-ci-${{ github.event.pull_request.number }} | |
| cancel-in-progress: true | |
| concurrency: | |
| group: unittest-ppu-ci-${{ github.event.pull_request.number || github.run_id }} | |
| cancel-in-progress: true |
| - name: FetchCommit | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| path: run_${{ github.run_id }} |
There was a problem hiding this comment.
Minor drift from unittest_h20_ci.yml: the H20 lane pins ref: ${{ github.event.pull_request.head.sha }}, this one doesn't. With pull_request, the default checkout is the merge ref — slightly different semantics (catches conflicts, but won't pin if the PR is rebased mid-run). If the intent was parity with H20, add the ref: line; if intentional, worth a one-line comment noting why.
Review summaryFocused, well-scoped PR: a new release-branch-only PPU CI lane plus minimal source changes (capability gating + tolerance widening) to make the existing test suite green on alixpu. The PR description's failure-mode → root-cause walk is unusually clear. Posture comments below; specific issues are inline. Test gating approach is right. Using
Security posture (informational, not a blocker). The lane runs on a self-hosted runner on Inline comments cover:
|
…=4x) PPU CI run 26269922710 had 28 of 31 errors caused by hardcoded subprocess timeouts (train_eval 600s, export 1800s) being too tight for PPU's slower kernels -- alixpu is intrinsically 3-5x slower than NVIDIA H20 on these paths once Triton autotune is cold. Wrap every `timeout=600` / `timeout=1800` in `tzrec/tests/utils.py` with a `_t(...)` helper that scales by `$TZREC_TEST_TIMEOUT_SCALE` (default 1.0 -- NVIDIA lanes unchanged) and set the scale to 4 in the PPU workflow's docker env so train_eval gets 2400s and export gets 7200s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ness) Capping num_warps<=16 in 26313f4 only partially fixed the bf16 D=257 rms_norm bwd mismatch on PPU -- run 26269922710 still showed FlakyFailure with abs diff 0.114 (was 0.196 before the cap). Root cause is deeper: the atomic_cas/atomic_xchg memory ordering in `_weighted_rms_norm_bwd_dx` interacts poorly with PPU's Triton runtime even when GROUP_N==N (each lock_id is unique and the lock is functionally a no-op). Add a `SKIP_LOCK: tl.constexpr` parameter and take a simple per-row `tl.store` path when the caller knows GROUP_N==N (which it does for any N <= the default 2048). Verified 5/5 stable runs of test_ln on ty_ppu. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…U torch PPU CI run 26323330816 had the 3 remaining AOT integration tests (test_multi_tower_din_with_fg_train_eval_aot_export_input_tile, test_rank_dlrm_hstu_train_eval_export, *_unified_aot) failing during predict() with: AttributeError: module 'torch._inductor' has no attribute 'codecache' at torch/export/pt2_archive/_package.py:1019 in _load_aoti, which does `torch._inductor.codecache.get_device_information(device)` and assumes the submodule was already populated on `torch._inductor`. The PPU CI image ships torch 2.10.0+ppu (the ty_ppu DSW host has 2.9.0+ppu); the 2.10 build does not auto-import `torch._inductor.codecache` when you just `import torch._inductor`, so the attribute access raises. Add an explicit `import torch._inductor.codecache` at the top of `tzrec/acc/aot_utils.py`, which is in the import chain of every AOTI load call site. Verified inside the tzrec-test:1.1-ppu container: all 3 previously-failing tests pass (584s, 354s, 5298s respectively). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After 025e240 added SKIP_LOCK to _weighted_rms_norm_bwd_dx, the PPU num_warps cap from 26313f4 is no longer needed: in-container 5/5 stable runs of test_ln with num_warps=32 re-enabled. The actual root cause was the atomic_cas/atomic_xchg pattern in bwd_dx; num_warps had only incidentally narrowed the failure window. Restore the upstream config sweep so PPU autotune retains the num_warps=32 option (NVIDIA-equivalent search space). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The SKIP_LOCK path in _weighted_rms_norm_bwd_dx only fixes GROUP_N==N. With large batches (N > GROUP_N default 2048) PPU's atomic_cas/atomic_xchg memory-ordering issue still produces flaky wrong dw -- reproduced inside the CI image with N=3000, D=257, bf16 (max_abs 0.5625 on 1 of 3 trials). Force Kernel.TRITON to fall back to PyTorch eager at the rms_norm dispatcher on PPU. Verified 5/5 stable test_ln in-container. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5497626 routes rms_norm Triton -> PyTorch on PPU at the dispatcher, so the kernel is no longer reached on PPU. SKIP_LOCK was a defense-in-depth addition that's now redundant; revert it to keep the kernel minimal and identical to upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
run_cmd already dumps the last 500 lines of log_file when the subprocess exits non-zero (e.g. NCCL watchdog crash), so CI logs contain the real stack trace. Extend the same treatment to the TimeoutExpired path so that hangs which exceed the parent's wall-clock budget (e.g. SCALE=4 train_eval at 2400s) also surface the subprocess's last activity before the kill, instead of just re-raising opaquely from misc_util.run_cmd. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds a PPU unit-test lane (
.github/workflows/unittest_ppu_ci.yml) forrelease/v1.1, targeting the self-hostedtzrec-ppu-runnerwith imagemybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/tzrec-test:1.1-ppu. Shaped after master'sunittest_h20_ci.yml(inlinedocker runinstead ofcontainer:to dodge the DSW authZ docker.sock injection), with PPU-specific adjustments: alixpu/devdevice passthrough instead of--gpus all,--shm-size=256g, noCUDA_HOMEenv, 24h job timeout, and a dedicatedscripts/ci/ci_test_ppu.shthat skipspip install(the image ships deps pre-installed).Trigger policy: auto-fires on PRs targeting
release/**(where stability matters) plusworkflow_dispatch. PRs to master do not auto-trigger — PPU runner capacity stays opt-in for development branches.Making the lane green
Initial run (26218011590) had 8 failures + 5 errors; follow-up runs surfaced 28 subprocess timeouts and an AOTI-load failure. Fixed by root cause:
hstu_attn_2_cuda(cutlass HSTU),torch_fx_tool(RTP export),dynamicemb. Gate the affected tests on@unittest.skipUnless(<dep>_available, ...)so they skip cleanly on any host missing the wheel (not a PPU-named carve-out).atomic_cas/atomic_xchgmemory-ordering bug in_weighted_rms_norm_bwd_dxproduced wrong dw for bf16 rms_norm backward (deterministic ~0.196 abs-diff at D=257 in CI; flaky ~0.5 abs-diff at N=3000 with contention via direct kernel probe). Routerms_norm's Kernel.TRITON to PyTorch eager on PPU at the dispatcher intzrec/ops/layer_norm.py; the Triton kernel itself is left untouched.get_compare_tolerance(dtype)that returns(atol=3e-5, rtol=2e-5)for fp32 on PPU (~2.1x / ~4.4x headroom on observed worst) and PyTorch defaults elsewhere, so NVIDIA's tight regression guard stays intact.dynamicemb_util.build_dynamicemb_constraints— referencedDynamicEmbScoreStrategyoutside thetry: import dynamicembguard. Made it fail loudly with a clear RuntimeError whenhas_dynamicembis False.train_evalat 600s andexportat 1800s were over-running on PPU. Wrap every hardcoded timeout intzrec/tests/utils.pywith a_t(...)helper that scales by$TZREC_TEST_TIMEOUT_SCALE(default 1.0 — NVIDIA lanes unchanged); set the scale to 4 in the PPU workflow's docker env (export budget becomes 7200s; observed unified_aot needs ~5300s in-container).torch._inductor.codecachenot auto-imported on PPU torch 2.10.0+ppu —import torch._inductordoes NOT populatetorch._inductor.codecache, and PyTorch's own_load_aoti(pt2_archive/_package.py:1019) accesses it as if pre-imported, raisingAttributeErrorin the predict subprocess of every AOT integration test. Fix is one explicitimport torch._inductor.codecacheat the top oftzrec/acc/aot_utils.py.New
tzrec.ops.is_ppu_arch()cached helper detects PPU by device-name substring.Test plan
release/v1.1and does NOT auto-fire for master PRs.🤖 Generated with Claude Code