[WIP] Chcai/mlperf refactor by chriscai-amd · Pull Request #887 · mlcommons/training

chriscai-amd · 2026-06-25T02:48:36Z

No description provided.

…7e51c) Vendored snapshot of chriscai-amd/generative-recommenders branch chcai/dlrmv4 (HEAD d97e51c) as a sibling of recommendation_v2/torchrec_dlrm. The Python package generative_recommenders keeps its original name so all imports work unchanged from the new location. - recommendation_v4/generative_recommenders/: dlrm_v3, modules, ops, research, tests - recommendation_v4/configs/: research HSTU gins - recommendation_v4/scripts/launch_smoke_8gpu.sh: sanitized 8-GPU yambda-5b launcher (resolves package root from script path; AMD env defaults; pip_local override) - recommendation_v4/{setup.py,requirements.txt,main.py,...}: upstream entry points - .gitmodules: cutlass registered at parent repo level Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Four fixes unlocking the HSTU_HAMMER_KERNEL=TRITON path on MI350X: 1. triton_hstu_attention.py _should_enable_tma(): add HIP early-out. torch.cuda.get_device_capability() on gfx950 returns (9, 5) which would pass the major==9 Hopper check and trick the kernel into the TMA path, producing kernels that don't compile on ROCm. 2. triton_hstu_attention.py _get_fw_configs(): hoist the USE_TLX/NUM_BUFFERS/ NUM_MMA_WARPS_PER_GROUP/NUM_MMA_GROUPS defaults loop out of the CUDA-only else: branch. The _hstu_attn_fwd signature requires these constexprs regardless of backend; missing them on HIP triggered TypeError: dynamic_func() missing N required positional arguments at autotune. Also gate the H100 TLX configs append on `not torch.version.hip`. 3. triton_jagged_tensors.py concat/split dispatch: route AMD/ROCm through *_2D_jagged_multirow instead of the basic _concat_2D_jagged / _split_2D_jagged kernels. The basic kernels fail PassManager::run at make_ttgir (TritonAMDGPUCanonicalizePointers pass) on ROCm; multirow compiles fine. NVIDIA non-Blackwell paths (H100/A100) are unchanged. 4. triton_jagged_tensors.py _Concat2DJaggedFunction.backward: replace the raw _split_2D_jagged[grid] call with _triton_split_2D_jagged_internal so the backward pass benefits from the same AMD multirow routing as the forward. Verified end-to-end on 8x MI350X: yambda-5b bs=32 seq=4k at 782 global_sps vs PYTORCH backend 547 sps -- 1.43x throughput, 75% peak VRAM vs 92%. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The attribute is absent in some Triton builds (e.g. nvcr.io/nvidia/pytorch:26.01-py3), causing import-time AttributeError before any training step runs. Use getattr with a False default so _use_meta_ws() gracefully reports disabled on those builds.

Three small changes so you can sweep model size and per-sample sequence length from a gin file without editing configs.py. configs.py: - get_hstu_configs is now @gin.configurable. Accepts optional overrides for max_seq_len, max_num_candidates, hstu_embedding_table_dim, hstu_transducer_embedding_dim, hstu_num_heads, hstu_attn_num_layers, hstu_attn_linear_dim, hstu_attn_qk_dim, hstu_input_dropout_ratio, hstu_linear_dropout_rate. Per-dataset defaults still apply unless explicitly overridden in gin. - get_embedding_table_config is now @gin.configurable with an embedding_dim override that uniformly sets the dim for all tables of the chosen dataset. - Drop the YAMBDA_EMBEDDING_DIM constant (was a duplicate of HSTU_EMBEDDING_DIM=512). Yambda branch now uses HSTU_EMBEDDING_DIM directly. Add a comment noting the model+table dim must stay aligned when overriding either via gin. utils.py: - get_dataset accepts an optional history_length kwarg that wins over the yambda dataset's hardcoded default of 4096. Caches are still keyed on disk under hstu_cache_L<N>/ so switching L between previously built values is free. train/gin/yambda_5b.gin: - Pin history_length=2048 and max_seq_len=2048 for the seq-2k smoke config. Both lines have inline comments explaining the +9 overhead (uid + 7 cross + 1 candidate) so total per-sample seq is ~2046, within the 2048 budget. Verified: default codepath unchanged, gin overrides apply consistently to both get_hstu_configs (model) and get_embedding_table_config (tables). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

make_optimizer_and_shard now accepts hbm_cap_gb (default 260, the MI350X value) via @gin.configurable. The yambda gin pins the same default so sweeps just change the number in the gin file instead of editing utils.py. ddr_cap dropped from 32 GiB to 0: with all 11 yambda 5b embedding tables fitting on 8x MI350X HBM, allowing host DRAM offload only invites the planner to pick slower per-lookup-PCIe-traffic plans. Verified gin binding flows through to the Topology: a probe with hbm_cap_gb=100 produced Topology(hbm_cap=107374182400) and the planner correctly raised insufficient-storage error at that tightness. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

preprocess_public_data.py: - Add DLRMYambdaProcessor: downloads Yambda multi_event + catalog metadata from the yandex/yambda HuggingFace repo, then runs a temporal split (300 train days / 30 min gap / 1 test day), builds per-user sessions (1800s inactivity threshold), and writes the layout DLRMv3YambdaDataset expects: <data-path>/raw/<size>/multi_event.parquet <data-path>/shared_metadata/{artist,album,embeddings}.parquet <data-path>/processed_<size>/{train_sessions,test_events, session_index}.parquet <data-path>/processed_<size>/item_popularity.npy <data-path>/processed_<size>/split_meta.json - 5b variant uses chunked polars load (10M rows/chunk) to keep peak RAM under control (single-shot read of the 50 GB parquet OOMs ~150 GB systems). - SUPPORTED_DATASETS adds yambda-50m, yambda-500m, yambda-5b. - main() takes --data-path for custom output root. - Verified end-to-end: 50m run completes in ~2 min, 5b in ~53 min (download dominates), output is byte-compatible with the dataset cache builder; TRITON training reaches steady state on the fresh data at 2050 sps. utils.py: - Add env_path(key, default) @gin.configurable helper. Used as a gin macro so any string-valued binding can be overridden by an env var without editing the gin file. train/gin/yambda_5b.gin: - Declare DATA_PATH = @env_path() macro with key="DLRM_DATA_PATH" and default="/apps/chcai/dlrm_data". Both new_path_prefix bindings (make_train_test_dataloaders and get_dataset) now consume %DATA_PATH. Setting DLRM_DATA_PATH=/some/path at run time redirects the dataset without a gin edit. datasets/yambda.py: - Strip stale references to upstream-internal preprocessing in docstrings/comments; point at preprocess_public_data.py instead. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Every rank's first CUDA context was landing on GPU 0 (the default device), so NCCL bound its communicators there before set_device switched to the correct GPU. This leaked allocations on GPU 0 across all 8 ranks and caused spurious OOMs during embedding-table init at high HBM caps. Moving set_device above init_process_group and passing device_id ensures each rank's NCCL state is created on its own GPU.

dlrm_v3/utils.py: - Replace the hardcoded manifold:// URL in _on_trace_ready_fn with a local trace_dir (default /tmp/dlrm_v3_traces). Filename now follows trace_step{step}_rank{rank}.json so per-rank captures don't collide. - Add _multi_window_schedule helper: a torch.profiler schedule that fires around each step in trace_steps=[...] (warmup before, active after, RECORD_AND_SAVE at the last active step). Lets one run capture multiple windows (e.g. early-step + steady-state) without re-running. - Make Profiler @gin.configurable. New knobs: trace_dir, trace_steps, wait, warmup, repeat, record_shapes, profile_memory, with_stack, with_flops, with_modules. Defaults preserve the prior single-window behavior (wait=10, warmup=20, active=50, repeat=1) so existing callers are unaffected. - Add run_results_dir(run_name) gin macro: resolves to <recommendation_v4>/results/<run_name>/. Used as the canonical output prefix for traces (and any future per-run artifacts). recommendation_v4/ is bind-mounted into the training container, so files written through this helper persist on the host. train/gin/yambda_5b.gin: - Wire RUN_NAME env override -> run_results_dir(run_name=%RUN_NAME) -> Profiler.trace_dir. Sets trace_steps=[52], warmup=5, active=5 (capture the 5-step window 52-56 on every rank). - Toggle train_eval_loop.output_trace = True so the profiler actually instantiates. .gitignore: - Add results/ alongside the existing tmp/exps/ckpts/ runtime directories so per-run trace dumps don't show up in git status. Verified: 8x MI350X TRITON yambda-5b run at bs=32 seq=2k drops 8 well-formed trace_step62_rank{0..7}.json files (~37 MB each) into recommendation_v4/results/default/; visible on the host immediately. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…mes, trim_warmup dlrm_v3/utils.py * Add run_results_dir(run_name) gin macro (resolves to <recommendation_v4>/results/<run_name>/) so trace artifacts persist on the host via the bind-mount. * Add _trim_warmup_from_trace post-processor: dedupes ProfilerStep spans by name first, then keeps only the last N unique steps' worth of events. Drops WARMUP-phase events that torch.profiler otherwise includes in the chrome trace. * Add trim_warmup kwarg (default True) on Profiler; auto-invokes the trimmer with N=active so the exported file matches the user-requested active window. * Filename now uses trace_steps[i] (the user-requested step) as the {step} label when multi-window mode is in use, instead of torch.profiler's internal step_num (which is off by ~warmup+active from the schedule trigger and confused everyone). train/utils.py * Drop hardcoded `active=10` from the four `Profiler(rank, active=10)` call sites in train_loop / train_eval_loop. Positional args block gin overrides; once removed, Profiler.active in gin (default 50) and user gin bindings actually take effect. train/gin/yambda_5b.gin * Fix env_path scoping collision: both DATA_PATH and RUN_NAME used the unscoped @env_path() configurable, which made the second binding's `env_path.key = "RUN_NAME"` overwrite the first's `env_path.key = "DLRM_DATA_PATH"`. Both names then resolved via the same env var (whichever was last), pointing DATA_PATH at trace_run2/ and breaking dataset loads. Fixed by giving each call site its own scope: @data/env_path() and @run/env_path(), each with independent .key/.default bindings. * Set Profiler.trace_steps=[52], warmup=1, active=5; let trim_warmup default to True so the exported trace contains exactly 5 active ProfilerStep events. Verified end-to-end: - Run with RUN_NAME=trace_run2 writes results/trace_run2/trace_step52_ rank{0..7}.json (~19 MB each), step labels match trace_steps gin. - Triton cache persisted across runs: cold start ~6 min -> warm start ~2 min for autotune-to-first-step. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2048 was chosen for "round number near max_seq_len" but it slightly overflows the per-sample budget: 3 * (2048//3) + 9 = 2055 > 2048, so the dataset truncates ~7 UIH events to fit. 2039 makes the math exact (3 * 679 + 9 = 2046 ≤ 2048) so no truncation. Comment block expanded to document: - The 3-pool gather semantic (L//3 events per pool, interleaved chronologically). - The like-pool under-fill observation: like events are only 1.9% of yambda corpus and max user lifetime is ~28k events, so the like pool fills to ~105 events per anchor on average (not 679). TRITON's jagged attention skips the unfilled slots, so under-fill costs sequence budget but not GPU compute. No code change. Cache for L=2039 already built and reused. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ol gather Documents the fork's scope (yambda-5b on HSTU dlrm_v3 path), per-pool gather strategy with effective fill table, and dataset statistics. Sections indexed 1–5 for navigation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds env_int gin macro (companion to env_path) and wires make_optimizer_and_shard.hbm_cap_gb through it so the per-rank HBM ceiling can be tuned without editing the gin file. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Document the container image, dependency versions (native NGC torch 2.10, triton 3.6, source-built fbgemm_gpu, torchrec 1.4.0, polars-u64-idx), gin training configuration, and env vars needed to reproduce the 8x B200 run.

Adds three knobs, all driven from the gin file: - make_model.bf16_training: enable bf16 autocast for the DlrmHSTU model. - env_int macro: lets numeric gin values come from env vars (used by the existing hbm_cap_gb binding). - apply_env_bootstrap.TRITON_FULL_AUTOTUNE: when False (default), three layer-norm/jagged triton kernels are pinned to a single Config so cold starts land at the same steady-state deterministically. When True, the full autotune search runs again — use this when changing shape, GPU, or triton/torch version, then re-pin from the discovered winners. train_ranker._main_func now parses gin in two phases (skip_unknown=True early, full pass after the heavy imports) so the bootstrap env var is set BEFORE the triton kernel modules evaluate their @triton.autotune decorators at module load time. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Mirrors the B200 layout with MI350X (gfx950, ROCm 7.2.1) specifics: container image (rocm/primus:v26.3), fbgemm_gpu rebuild requirement (HEAD nightly_rocm-2026.6.1 for ~30% step-time win over the shipped 2026.5.14), the gin-driven TRITON_FULL_AUTOTUNE knob, and the measured perf ladder from fp32/PYTORCH baseline (~28 d/epoch) down to the pinned bf16/TRITON fast equilibrium (~7.6 d/epoch). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Merges per-rank chrome traces (results/<run>/trace_step{N}_rank{R}.json) into a single Perfetto-loadable file, remapping pid/flow ids so cross-rank events land on distinct tracks instead of collapsing onto one. Used to produce the bf16 + pinned-autotune step-52 trace (results/verify_rename/trace_step52.json.gz). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Refresh the B200 dependency versions to the latest validated stack (torch 2.12.0a0 / CUDA 13.2, fbgemm_gpu built for sm_100+CUDA 13.2, CUPTI 13.2), note 26.01 as an equivalent alternative, and record the TRITON_FULL_AUTOTUNE=True setting for B200.

…iver) Point fbgemm at the latest validated source commit (10b77573, 2026-06-01), record the tested torchrec 1.7.0.dev nightly (1.4.0 stable fallback), clarify the fbgemm wheel version string is the build date, and correct the host/forward-compat driver CUDA versions (13.0 host / 595.58.03 compat).

After upgrading to torch 2.12 / torchrec 1.7 (B200-aligned), the pinned configs from the torch 2.10 stack stopped landing on the fast equilibrium because the torchrec 1.7 code path invokes these kernels at different shape keys. Re-captured winners via a fresh autotune run and updated the pin sites: - _weighted_layer_norm_bwd_dx: BLOCK_N 8 -> 1 (num_warps 1 unchanged) - split_2D_jagged_multirow: BLOCK_N 1 / num_warps 2 -> BLOCK_N 8 / num_warps 1 - _layer_norm_bwd_dwdb: BLOCK_N 128, num_warps 8 (unchanged - same winner on both stacks) Verified: 3 consecutive checkpoints (steps 151/201/251) at 52.75-53.36 ms deterministic on the new stack. Same equilibrium band as the torch 2.10 stack (51.5-53.0 ms). Also adds a Stack B section to docs/training_recipe.md (MI350X) documenting the torch 2.12 swap recipe (torch + torchvision + torchaudio + fbgemm rebuild + torchrec git tag) so the MI350X recipe is dependency-aligned with the B200 path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Bumps the Stack B (torch 2.12 / torchrec 1.7) section to: - fbgemm commit 10b77573 (same SHA as the B200 path) instead of 1509423 (one cosmetic commit behind). Wheel rename 2026.6.1 -> 2026.6.2. - Note that Stack A and Stack B use different pinned triton configs (already merged) and explain why (torchrec 1.7 invokes the kernels at different shape keys). - Caveat: HSTU_HAMMER_KERNEL=PYTORCH fallback regresses to ~169 ms on Stack B (vs 107 ms on Stack A). TRITON is unaffected and remains the default; this only matters for PYTORCH-backend debugging. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Collapses the two-stack MI350X section into one canonical dependency table: torch 2.12 / torchrec 1.7 / fbgemm @ 10b77573 — the same SHAs as the B200 path. The image-native torch 2.10 / torchrec 1.4 / fbgemm 2026.5.14 path still works for development but the recipe doc now documents the validated production stack only. PYTORCH-backend caveat preserved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Not relevant — TRITON is the documented default backend. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ecipe Embedding sizes match the true entity counts in yambda-5b: item_id 9_390_000 -> 9_390_624 artist_id 1_290_000 -> 1_293_395 album_id 3_370_000 -> 3_367_692 uid 1_000_000 -> 1_000_001 This eliminates the recurring "EmbeddingBoundsCheck ... Setting idx to zero" warnings at training time. Gin default raised to batch_size=1024 / eval_batch_size=1024. Measured steady-state on the torch 2.12 + torchrec 1.7 + fbgemm HEAD stack with TRITON HSTU + pinned triton configs: ~635 ms/step, ~12.9K sps, ~2.92 days/epoch vs ~7.6 days at bs=32. bs=2048 is feasible but only +3% throughput at much higher autotune cost, so bs=1024 is the sweet spot. Triton autotune pin for _weighted_layer_norm_bwd_dx now ships TWO configs in the pinned list — BLOCK_N=1 (bs=32 winner) and BLOCK_N=8 (bs=1024 winner). Triton's autotune key=[BLOCK_D] dispatches the right one per shape in <5 sec on cold start (vs ~30 sec from the full pool). The other two pinned kernels (_layer_norm_bwd_dwdb, split_2D_jagged_multirow) have identical winners at bs=32 and bs=1024 so they stay single-config. Training-recipe doc drops the batch_size rows from both MI350X and B200 config tables — the recipe is intentionally batch-size-agnostic now that the pin set covers a range. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Enable the multi-row, separated-RNG _ln_mul_dropout path on AMD MI350 (gfx950), previously Blackwell-only. Batches rows per program and reuses a precomputed dropout mask in the backward instead of one-program-per-row fused RNG; +5.6% end-to-end (-> 14,222 global sps) at bs=1024 on yambda-5b. - ops/utils.py: add is_amd_mi350() + use_separated_rng_ln_mul_dropout() gate. - ops/triton/triton_hstu_linear.py: dispatch the fwd LN-dropout to the separated-RNG path via the new gate. - ops/triton/triton_hstu_attention.py: pin fast nonkdim:16 fwd/persistent/bwd configs via pinned_or_full (TRITON_FULL_AUTOTUNE=1 still bypasses). Multi-config lists with an inline "add a new batch size" guide. - scripts/launch_smoke_8gpu.sh: GPU clock sanity guard - log perf level + sclk, auto-restore 'auto' if a perf_determinism/manual/low lock is found (a half-clock lock uniformly slowed every Triton kernel ~1.9x and masked perf changes). - docs/perf_opt.md: document the LN-dropout fix and the clock-lock caveat. Co-authored-by: Cursor <cursoragent@cursor.com>

…ernel Add an opt-in TrainPipelineSparseDist path that overlaps the embedding input-distribution all-to-all with dense fwd/bwd. To make the embedding collection pipelineable, the merged sparse KJT is now pre-built in the dataloader (Samples.merged_sparse_features) and the model consumes it via a _pipeline_mode forward that takes the batch as a single arg, so TorchRec's tracer resolves the lookup input as a plain getattr off the batch. - dataset.py: Samples.merged_sparse_features + merge_uih_candidate_kjts, built in collate_fn; wired into to()/record_stream()/pin_memory(). - dlrm_hstu.py: _pipeline_mode flag; forward unpacks the batch and preprocess accepts the prebuilt merged KJT (falls back to building it when absent). - utils.py: _PipelineModelWrapper, build_train_pipeline, train_eval_loop use_pipeline branch + eval batch-arg; seed all RNGs in setup() for reproducible weight init. - gin/launch: make_model.hammer_kernel selects TRITON vs PYTORCH (env override still honored); launch script defers to the gin default. use_pipeline defaults to False. Validated on MI350/ROCm 8-GPU: embedding collection is pipelined (input-dist a2a moves to hidden); model quality and throughput match the sequential path (seeded A/B). The exposed embedding-output a2a still dominates the step, so throughput is unchanged — pipelining is quality- and perf-neutral here. Co-authored-by: Cursor <cursoragent@cursor.com>

Add a forward-in-time streaming path: slice the timeline into fixed-duration windows (default 1 day), train window T then eval window T+1, enforcing no future leakage (across-window + causal-history guarantees). Make it the default mode in launch_smoke_8gpu.sh. Window-reset overhead is hidden via a persistent worker pool + double buffering (next window's index mask and first-batch prefetch overlap compute on a background thread) and eval-window prefetch one window ahead, dropping train/eval first-batch waits to ~1-3ms with no steady-state regression. Window selection uses a lazily-built, mmap'd anchor-timestamp cache so the default non-streaming path is unaffected. Also harden trace export (best-effort: IO/permission failures warn instead of crashing training) now that streaming enables output_trace by default, and document the path + knobs in the README. Co-authored-by: Cursor <cursoragent@cursor.com>

save_dmp_checkpoint.path now resolves from $CKPT_PATH and defaults to empty, so checkpoints (a full DMP is ~100s of GB, and the streaming loop always saves the final window) are off unless explicitly enabled. Also drop the stale training-recipe sentence claiming native torch is kept — it contradicts the dependency table, which replaces torch and keeps only the image's triton. Co-authored-by: Cursor <cursoragent@cursor.com>

Add in-process trace postprocessing in the profiler on_trace_ready callback to fix two ROCm/roctracer rendering artifacts that make MI350X traces look wrong in Perfetto (the timing is correct, only the layout): - _normalize_profilerstep_layout: collapse the fragmented GPU-side ProfilerStep#N spans (roctracer splits a step across the HIP null + compute streams) into one full-width span per step on the busiest compute stream, matching the CUDA look. - _deoverlap_gpu_slices: pull back sub-us kernel end timestamps so back-to-back kernels don't touch/overlap; Perfetto otherwise nests the later (long) kernel inside the tiny epilogue and clips it to zero width, hiding kernels like _hstu_attn_bwd. Leaves a ~1ns gap (exact end==start is just as fatal as an overlap) and leaves real nesting untouched. Both passes are gated behind _is_rocm() (torch.version.hip) so they are complete no-ops on CUDA/B200, which don't have these artifacts. All best-effort: failures degrade to a warning and never crash training. Co-authored-by: Cursor <cursoragent@cursor.com>

Add _deoverlap_gpu_annotations to the trace-export postprocessing, the annotation-boundary analog of the kernel de-overlap. Kineto projects the forward/backward phase annotations (## user_forward ##, ## item_forward ##, ## stu_* ##, ...) onto the GPU stream as a chain of end-to-end siblings. The absolute step timestamps are ~5.4e12 us, where a float64's quantum is ~1 ns, so a sibling boundary that should be coincident lands a few ns off; when the earlier sibling ends at/after the next one's start, Perfetto nests and clips the next span to a sliver -- e.g. the 100+ ms ## user_forward ## vanishes on some ranks/steps purely by rounding luck. Since annotations form a real nesting hierarchy (user_forward contains the stu_* spans and their kernels), this walks the per-track slice stack and only snaps a slice back when the next slice extends beyond it (siblings, not parent/child), guarding against trimming into a span's own descendants. It also snaps kernel tails that straddle an annotation boundary. Gated by _is_rocm() (no-op on B200/CUDA) and best-effort like the other passes. Verified end-to-end on an 8-rank MI350X run: ## user_forward ## renders 40/40 (was 9/40), total clipped annotations 1352 -> ~5. Co-authored-by: Cursor <cursoragent@cursor.com>

Make streaming-train-eval crash-resumable and add general checkpoint cadence controls: - Atomic checkpoint saves (.tmp dir + rename), keep_last_n pruning, and swap-aside .old overwrite so a save can safely replace an existing train_ts dir; stale .tmp/.old swept on the next save. - Per-rank RNG snapshot/restore for bit-equal dropout replay on resume; auto-latest-subdir resolution + (train_ts, batch_idx_in_window) resume hint so a run re-enters a partial window and skips already-trained batches exact-once. - Three independent in-window checkpoint cadences via a pure, testable decision helper: per-window batch count, monotonic global step (e.g. every 1000 steps), and wall-clock interval (e.g. hourly, rank-0-decided + broadcast to keep the save barrier in lockstep). - gin/env bindings for all cadences + a test-only die_at_step hook. Tests: checkpoint_cadence_test.py (cadence precedence/triggers) and an end-to-end baseline/interrupt/resume harness (streaming_resume_test.{sh,py}) that gates on functional invariants (RNG restored, correct resumed step, atomic save, keep_last_n) plus a loose trajectory-closeness bound. Co-authored-by: Cursor <cursoragent@cursor.com>

…TEGY passthrough - gin: MIN_HISTORY default 0 -> 4086 (power-users floor at the full history budget; maps to the existing positions_L4086.npy cache, no rebuild/no shared-dir write). AUC_THRESHOLD left unchanged (0.80275) pending finalization. - launch_slurm.sh: forward $HISTORY_STRATEGY through the worker docker exec -e block (was silently dropped, so the knob never reached the worker); fix the stale lr-override echo (gin default is 1e-7, not 0.001). - README: document MIN_HISTORY default as 4086. Co-authored-by: Cursor <cursoragent@cursor.com>

Stop tracking the local walkthrough doc (kept on disk, no longer in the repo). Co-authored-by: Cursor <cursoragent@cursor.com>

Replace the example username in launch_slurm.sh / streaming_resume_test.sh comments with a generic <user> placeholder. Runtime defaults already derive the container name + mounts from $USER, so only doc/example strings changed. Co-authored-by: Cursor <cursoragent@cursor.com>

Document a complete sbatch launch (run-shape + data-fraction eval cadence) for 1-node and 2-node, noting the launchers differ only in --nodes. Co-authored-by: Cursor <cursoragent@cursor.com>

MetricsLogger now brackets eval and checkpoint phases with pause/resume perf timers so the reported step_ms reflects pure train-step latency. Adds wall_step_ms (inclusive), eval_ms, and ckpt_ms to the perf log line and TensorBoard scalars (appended for parser backward-compat). Checkpoint saves and eval windows are wrapped with categorized pause/resume calls. Co-authored-by: Cursor <cursoragent@cursor.com>

… consistency Co-authored-by: Cursor <cursoragent@cursor.com>

orchestrate() now defaults to the reference run-shape (START_TS=0, NUM_TRAIN_TS=299, full windows) + the data-fraction eval cadence (EVAL_EVERY_DATA_PCT=0.005, per-window off), so `sbatch scripts/launch_slurm.sh` needs no env knobs. SMOKE=1 restores the previous fast functional defaults (short window, capped batches, per-window eval). The two eval cadences are auto-deconflicted (explicit EVAL_EVERY_N_WINDOWS>0 disables data-pct). gin library defaults + the resume/local smoke paths are unchanged. README updated to the bare single/multi-node commands. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # recommendation_v4/generative_recommenders/dlrm_v3/train/utils.py

…s robustness - Dockerfiles (AMD+NVIDIA): install mlperf_logging (--no-deps) so compliance logging is not silently disabled at runtime. - Pin mlcommons/logging to 6.0.0-rc6 in requirements.txt + both Dockerfiles for reproducibility. - mlperf_logging_utils: guard empty os.path.dirname(log_path); attach file handler on rank 0 only; get_mlperf_logger() returns None when the dep is unavailable so callers' `is not None` guards disable logging cleanly. - launch_slurm.sh: chmod 622 (not 666) on the job log — tee -a needs write only, avoids world-readable logs on shared NFS. Co-authored-by: Cursor <cursoragent@cursor.com>

Add decorrelate_runtime_rng(rank): after make_model + DMP init, re-seed torch/cuda with $SEED + rank so HSTU dropout draws different masks per data-parallel rank instead of the identical masks implied by the shared init seed. Runs strictly after init so replicated dense weights and sharded embeddings stay init-identical across ranks. Toggle via $DECORRELATE_DROPOUT (default 1; 0 = legacy identical-mask behavior). Offset is a pure fn of resolved $SEED + rank and per-rank RNG state is checkpointed, so reproducibility/resume are preserved. Co-authored-by: Cursor <cursoragent@cursor.com>

Remove the B200 Dockerfile.nvidia and the local scripts/run_docker.sh helper — neither is wired into launch_slurm.sh (which uses rocm/primus directly) and both are environment-specific. Gitignore them plus the local ad-hoc analysis artifacts so they are not re-added. Co-authored-by: Cursor <cursoragent@cursor.com>

…l metric - Collapse verbose comment blocks across gin/launch_slurm/resume-test to 1-2 lines; revert container name to yambda_primus and drop #SBATCH --time. - Default DECORRELATE_DROPOUT=0 (identical dropout masks across ranks). - Generalize EVAL_ACCURACY_AUC_MODE to any {window,lifetime}_{auc,gauc, accuracy,ne} metric, with NE handled as lower-is-better for early-stop. Co-authored-by: Cursor <cursoragent@cursor.com>

Collapse the large explanatory comment/docstring blocks in the MLPerf logging wiring (logger, train_ranker boundaries, streaming loop hooks, MetricsLogger counters) to keep the PR reviewable. No logic changes. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Revert recommendation_v4/.gitignore to base. Local run artifacts and ad-hoc analysis files are kept out of the repo via .git/info/exclude (local, uncommitted) instead, removing one file from the PR diff. Co-authored-by: Cursor <cursoragent@cursor.com>

Move the MLPerf event stream into mlperf_logging_utils.py: a MLPerfRunTracker state machine owns the block/eval/run markers, progress metadata, and the convergence decision (replacing ~145 lines of closures in streaming_train_eval_loop), and MLPerfLogger.log_run_start emits submission info + hyperparameters + INIT_STOP/RUN_START (collapsing the inline block in train_ranker). Convergence/EVAL_ACCURACY is fixed to per-window AUC: drop the eval_accuracy_auc_mode knob (gin + loop param + launch_slurm passthrough). Submission identity: SUBMISSION_ORG defaults to AMD, SUBMISSION_PLATFORM to MI355X (was the org name — a bug), both overridable via $MLPERF_SUBMISSION_PLATFORM. Co-authored-by: Cursor <cursoragent@cursor.com>

The lifetime cumulative AUC always uses the exact binned backend now. Remove the TRAIN_LIFETIME_AUC_MODE / EVAL_LIFETIME_AUC_MODE env overrides (and the capped-only LIFETIME_AUC_WINDOW knob, now dead) from gin and launch_slurm. Co-authored-by: Cursor <cursoragent@cursor.com>

…rride" This reverts commit 900d4f1.

Reduce the upstream launch_slurm.sh diff to just what MLPerf logging needs: SCRATCH/REPO_MOUNT/DATA_MOUNT path portability (so outputs/log land off the hardcoded /home/chcai,/apps/chcai) and the MLPerf env wiring (MLPERF_LOG_PATH, AUC_THRESHOLD, MLPERF_LOGGING, MLPERF_SUBMISSION_PLATFORM, MLPERF_TRAIN_LOSS_LOG_FREQ). Reverted the unrelated baseline changes (NCCL GDR/IFNAME defaults, SMOKE/frozen run-shape, chmod/WORKER_TEE, HISTORY_STRATEGY, lifetime-AUC passthroughs) to Chris' base. Preserve the full kitchen-sink launcher as launch_slurm_suachong.sh for personal multi-node use (self-reinvoke paths repointed to itself). Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Persist mlperf_run_started in the checkpoint so a resume relaunch continues the SAME MLPerf run instead of re-emitting INIT_START/RUN_START (compliance requires EXACTLY_ONE). Cold-vs-resume is detected from the on-disk checkpoint before setup(); the log is truncated on a cold start and appended on a resume so the single run's event stream accumulates into one file. Legacy/cold checkpoints default the flag to False. Co-authored-by: Cursor <cursoragent@cursor.com>

Default SEED back to 1 for a fixed, reproducible weight init out of the box ($SEED=-1 still draws a fresh random seed per run). Default AUC_THRESHOLD to 1.0 (unreachable) in both the gin binding and the launch_slurm.sh fallback so a streaming-train-eval run trains through all windows by default instead of early-stopping; set $AUC_THRESHOLD=0.80275 for the MLPerf convergence target. Co-authored-by: Cursor <cursoragent@cursor.com>

Restore NCCL_NET_GDR_LEVEL=5 + NCCL_DMABUF_ENABLE=1 defaults so RCCL does true GPU<->NIC DMA over bnxt_re instead of host-memory staging (~+22% throughput at 2 nodes; 65.7%->79.8% weak-scaling efficiency). The brcmrdma host kernel ships the inbox peer-memory client, so GDR works with no container/host changes; non-fatal fallback to host staging if peermem is absent. Override with NCCL_NET_GDR_LEVEL=0. Co-authored-by: Cursor <cursoragent@cursor.com>

The slimmed launch_slurm.sh has smoke-shaped run defaults (START_TS=150, NUM_TRAIN_TS=1, NUM_TRAIN_BATCHES=20, per-window eval) and no SMOKE=1 toggle, so a bare submit is a short functional run — not the 299-window reference. Document the bare submit as the smoke run and give the explicit env-override command for the full reference sweep; drop the unimplemented SMOKE=1 instructions. Co-authored-by: Cursor <cursoragent@cursor.com>

dlrmv4: portable multi-node baseline + MLPerf compliance logging

Bring the HSTU/yambda-5b benchmark in line with MLPerf Training reference conventions: - add download_dataset.sh / verify_dataset.sh / run_and_time.sh wrappers - add md5sums checksum file (placeholder hashes) for dataset verification - restructure README.MD to the MLPerf spec (summaries, model+paper, hyperparameter table with tuning rules, quality target, eval frequency, steps-to-run) - freeze requirements.txt to the exact Dockerfile/training_recipe versions - add blank RCP placeholder (rcp/README.md) to be filled once convergence runs are generated Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-06-25T02:48:45Z

MLCommons CLA bot:
Thank you very much for your submission; we really appreciate it. Before we can accept your contribution,
we ask that you sign the MLCommons CLA (Apache 2). Please submit your GitHub ID to our onboarding form to initiate
authorization. If you are from a MLCommons member organization, we will request that you be added to the CLA.
If you are not from a member organization, we will email you a CLA to sign. For any questions, please contact
support@mlcommons.org.
2 out of 4 committers have signed the MLCommons CLA.
✅ @chriscai-amd
✅ @suachong
❌ @su Ann Chong
❌ @nehaprakriya
Su Ann Chong seems not to be a GitHub user. You need a GitHub account after you become MLCommons member. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You can retrigger this bot by commenting recheck in this Pull Request}

chriscai-amd and others added 30 commits May 29, 2026 16:21

gin: make hbm_cap_gb overridable via \$HBM_CAP_GB

1229f59

Adds env_int gin macro (companion to env_path) and wires make_optimizer_and_shard.hbm_cap_gb through it so the per-rank HBM ceiling can be tuned without editing the gin file. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

docs: add B200 training recipe for yambda-5b

4de66a4

Document the container image, dependency versions (native NGC torch 2.10, triton 3.6, source-built fbgemm_gpu, torchrec 1.4.0, polars-u64-idx), gin training configuration, and env vars needed to reproduce the 8x B200 run.

docs: drop PYTORCH-fallback caveat from MI350X recipe

04d0142

Not relevant — TRITON is the documented default backend. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

suachong and others added 27 commits June 24, 2026 00:49

dlrmv4: untrack docs/v4_vs_v2_and_hstu_walkthrough.md

e4371b2

Stop tracking the local walkthrough doc (kept on disk, no longer in the repo). Co-authored-by: Cursor <cursoragent@cursor.com>

dlrmv4: README — add full single/multi-node reference run example

12740df

Document a complete sbatch launch (run-shape + data-fraction eval cadence) for 1-node and 2-node, noting the launchers differ only in --nodes. Co-authored-by: Cursor <cursoragent@cursor.com>

dlrmv4: README — use AUC_THRESHOLD=0.80275 in example for gin-default…

776f263

… consistency Co-authored-by: Cursor <cursoragent@cursor.com>

Merge remote-tracking branch 'origin/chcai/dlrmv4' into suachong/dlrmv4

2174392

Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # recommendation_v4/generative_recommenders/dlrm_v3/train/utils.py

dlrmv4: trim seed_everything / decorrelate_runtime_rng docstrings

af596c8

Co-authored-by: Cursor <cursoragent@cursor.com>

Revert "dlrmv4: hardcode lifetime-AUC backend to binned, drop the ove…

a584795

…rride" This reverts commit 900d4f1.

dlrmv4: drop launch_slurm_suachong.sh from the PR (keep local only)

97b7057

Co-authored-by: Cursor <cursoragent@cursor.com>

dlrmv4: revert streaming_resume_test.sh to base (out of MLPerf PR scope)

488cacb

Co-authored-by: Cursor <cursoragent@cursor.com>

Merge pull request #1 from chriscai-amd/suachong/dlrmv4

1632206

dlrmv4: portable multi-node baseline + MLPerf compliance logging

chriscai-amd closed this Jun 25, 2026

github-actions Bot locked and limited conversation to collaborators Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Chcai/mlperf refactor#887

[WIP] Chcai/mlperf refactor#887
chriscai-amd wants to merge 107 commits into
mlcommons:masterfrom
chriscai-amd:chcai/mlperf_refactor

chriscai-amd commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

chriscai-amd commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants