[WIP] Chcai/mlperf refactor#887
Conversation
…7e51c)
Vendored snapshot of chriscai-amd/generative-recommenders branch chcai/dlrmv4
(HEAD d97e51c) as a sibling of recommendation_v2/torchrec_dlrm. The Python
package generative_recommenders keeps its original name so all imports work
unchanged from the new location.
- recommendation_v4/generative_recommenders/: dlrm_v3, modules, ops, research, tests
- recommendation_v4/configs/: research HSTU gins
- recommendation_v4/scripts/launch_smoke_8gpu.sh: sanitized 8-GPU yambda-5b launcher
(resolves package root from script path; AMD env defaults; pip_local override)
- recommendation_v4/{setup.py,requirements.txt,main.py,...}: upstream entry points
- .gitmodules: cutlass registered at parent repo level
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Four fixes unlocking the HSTU_HAMMER_KERNEL=TRITON path on MI350X: 1. triton_hstu_attention.py _should_enable_tma(): add HIP early-out. torch.cuda.get_device_capability() on gfx950 returns (9, 5) which would pass the major==9 Hopper check and trick the kernel into the TMA path, producing kernels that don't compile on ROCm. 2. triton_hstu_attention.py _get_fw_configs(): hoist the USE_TLX/NUM_BUFFERS/ NUM_MMA_WARPS_PER_GROUP/NUM_MMA_GROUPS defaults loop out of the CUDA-only else: branch. The _hstu_attn_fwd signature requires these constexprs regardless of backend; missing them on HIP triggered TypeError: dynamic_func() missing N required positional arguments at autotune. Also gate the H100 TLX configs append on `not torch.version.hip`. 3. triton_jagged_tensors.py concat/split dispatch: route AMD/ROCm through *_2D_jagged_multirow instead of the basic _concat_2D_jagged / _split_2D_jagged kernels. The basic kernels fail PassManager::run at make_ttgir (TritonAMDGPUCanonicalizePointers pass) on ROCm; multirow compiles fine. NVIDIA non-Blackwell paths (H100/A100) are unchanged. 4. triton_jagged_tensors.py _Concat2DJaggedFunction.backward: replace the raw _split_2D_jagged[grid] call with _triton_split_2D_jagged_internal so the backward pass benefits from the same AMD multirow routing as the forward. Verified end-to-end on 8x MI350X: yambda-5b bs=32 seq=4k at 782 global_sps vs PYTORCH backend 547 sps -- 1.43x throughput, 75% peak VRAM vs 92%. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The attribute is absent in some Triton builds (e.g. nvcr.io/nvidia/pytorch:26.01-py3), causing import-time AttributeError before any training step runs. Use getattr with a False default so _use_meta_ws() gracefully reports disabled on those builds.
Three small changes so you can sweep model size and per-sample sequence
length from a gin file without editing configs.py.
configs.py:
- get_hstu_configs is now @gin.configurable. Accepts optional overrides
for max_seq_len, max_num_candidates, hstu_embedding_table_dim,
hstu_transducer_embedding_dim, hstu_num_heads, hstu_attn_num_layers,
hstu_attn_linear_dim, hstu_attn_qk_dim, hstu_input_dropout_ratio,
hstu_linear_dropout_rate. Per-dataset defaults still apply unless
explicitly overridden in gin.
- get_embedding_table_config is now @gin.configurable with an
embedding_dim override that uniformly sets the dim for all tables
of the chosen dataset.
- Drop the YAMBDA_EMBEDDING_DIM constant (was a duplicate of
HSTU_EMBEDDING_DIM=512). Yambda branch now uses HSTU_EMBEDDING_DIM
directly. Add a comment noting the model+table dim must stay aligned
when overriding either via gin.
utils.py:
- get_dataset accepts an optional history_length kwarg that wins over
the yambda dataset's hardcoded default of 4096. Caches are still
keyed on disk under hstu_cache_L<N>/ so switching L between previously
built values is free.
train/gin/yambda_5b.gin:
- Pin history_length=2048 and max_seq_len=2048 for the seq-2k smoke
config. Both lines have inline comments explaining the +9 overhead
(uid + 7 cross + 1 candidate) so total per-sample seq is ~2046,
within the 2048 budget.
Verified: default codepath unchanged, gin overrides apply consistently
to both get_hstu_configs (model) and get_embedding_table_config (tables).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
make_optimizer_and_shard now accepts hbm_cap_gb (default 260, the MI350X value) via @gin.configurable. The yambda gin pins the same default so sweeps just change the number in the gin file instead of editing utils.py. ddr_cap dropped from 32 GiB to 0: with all 11 yambda 5b embedding tables fitting on 8x MI350X HBM, allowing host DRAM offload only invites the planner to pick slower per-lookup-PCIe-traffic plans. Verified gin binding flows through to the Topology: a probe with hbm_cap_gb=100 produced Topology(hbm_cap=107374182400) and the planner correctly raised insufficient-storage error at that tightness. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
preprocess_public_data.py:
- Add DLRMYambdaProcessor: downloads Yambda multi_event + catalog
metadata from the yandex/yambda HuggingFace repo, then runs a
temporal split (300 train days / 30 min gap / 1 test day),
builds per-user sessions (1800s inactivity threshold), and
writes the layout DLRMv3YambdaDataset expects:
<data-path>/raw/<size>/multi_event.parquet
<data-path>/shared_metadata/{artist,album,embeddings}.parquet
<data-path>/processed_<size>/{train_sessions,test_events,
session_index}.parquet
<data-path>/processed_<size>/item_popularity.npy
<data-path>/processed_<size>/split_meta.json
- 5b variant uses chunked polars load (10M rows/chunk) to keep
peak RAM under control (single-shot read of the 50 GB parquet
OOMs ~150 GB systems).
- SUPPORTED_DATASETS adds yambda-50m, yambda-500m, yambda-5b.
- main() takes --data-path for custom output root.
- Verified end-to-end: 50m run completes in ~2 min, 5b in ~53 min
(download dominates), output is byte-compatible with the dataset
cache builder; TRITON training reaches steady state on the
fresh data at 2050 sps.
utils.py:
- Add env_path(key, default) @gin.configurable helper. Used as a
gin macro so any string-valued binding can be overridden by an
env var without editing the gin file.
train/gin/yambda_5b.gin:
- Declare DATA_PATH = @env_path() macro with key="DLRM_DATA_PATH"
and default="/apps/chcai/dlrm_data". Both new_path_prefix
bindings (make_train_test_dataloaders and get_dataset) now
consume %DATA_PATH. Setting DLRM_DATA_PATH=/some/path at run
time redirects the dataset without a gin edit.
datasets/yambda.py:
- Strip stale references to upstream-internal preprocessing in
docstrings/comments; point at preprocess_public_data.py instead.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Every rank's first CUDA context was landing on GPU 0 (the default device), so NCCL bound its communicators there before set_device switched to the correct GPU. This leaked allocations on GPU 0 across all 8 ranks and caused spurious OOMs during embedding-table init at high HBM caps. Moving set_device above init_process_group and passing device_id ensures each rank's NCCL state is created on its own GPU.
dlrm_v3/utils.py:
- Replace the hardcoded manifold:// URL in _on_trace_ready_fn with a
local trace_dir (default /tmp/dlrm_v3_traces). Filename now follows
trace_step{step}_rank{rank}.json so per-rank captures don't collide.
- Add _multi_window_schedule helper: a torch.profiler schedule that
fires around each step in trace_steps=[...] (warmup before, active
after, RECORD_AND_SAVE at the last active step). Lets one run
capture multiple windows (e.g. early-step + steady-state) without
re-running.
- Make Profiler @gin.configurable. New knobs: trace_dir, trace_steps,
wait, warmup, repeat, record_shapes, profile_memory, with_stack,
with_flops, with_modules. Defaults preserve the prior single-window
behavior (wait=10, warmup=20, active=50, repeat=1) so existing
callers are unaffected.
- Add run_results_dir(run_name) gin macro: resolves to
<recommendation_v4>/results/<run_name>/. Used as the canonical
output prefix for traces (and any future per-run artifacts).
recommendation_v4/ is bind-mounted into the training container, so
files written through this helper persist on the host.
train/gin/yambda_5b.gin:
- Wire RUN_NAME env override -> run_results_dir(run_name=%RUN_NAME)
-> Profiler.trace_dir. Sets trace_steps=[52], warmup=5, active=5
(capture the 5-step window 52-56 on every rank).
- Toggle train_eval_loop.output_trace = True so the profiler actually
instantiates.
.gitignore:
- Add results/ alongside the existing tmp/exps/ckpts/ runtime
directories so per-run trace dumps don't show up in git status.
Verified: 8x MI350X TRITON yambda-5b run at bs=32 seq=2k drops
8 well-formed trace_step62_rank{0..7}.json files (~37 MB each) into
recommendation_v4/results/default/; visible on the host immediately.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…mes, trim_warmup
dlrm_v3/utils.py
* Add run_results_dir(run_name) gin macro (resolves to
<recommendation_v4>/results/<run_name>/) so trace artifacts persist on
the host via the bind-mount.
* Add _trim_warmup_from_trace post-processor: dedupes ProfilerStep spans
by name first, then keeps only the last N unique steps' worth of
events. Drops WARMUP-phase events that torch.profiler otherwise
includes in the chrome trace.
* Add trim_warmup kwarg (default True) on Profiler; auto-invokes the
trimmer with N=active so the exported file matches the user-requested
active window.
* Filename now uses trace_steps[i] (the user-requested step) as the
{step} label when multi-window mode is in use, instead of
torch.profiler's internal step_num (which is off by ~warmup+active
from the schedule trigger and confused everyone).
train/utils.py
* Drop hardcoded `active=10` from the four `Profiler(rank, active=10)`
call sites in train_loop / train_eval_loop. Positional args block
gin overrides; once removed, Profiler.active in gin (default 50) and
user gin bindings actually take effect.
train/gin/yambda_5b.gin
* Fix env_path scoping collision: both DATA_PATH and RUN_NAME used the
unscoped @env_path() configurable, which made the second binding's
`env_path.key = "RUN_NAME"` overwrite the first's
`env_path.key = "DLRM_DATA_PATH"`. Both names then resolved via the
same env var (whichever was last), pointing DATA_PATH at trace_run2/
and breaking dataset loads.
Fixed by giving each call site its own scope: @data/env_path() and
@run/env_path(), each with independent .key/.default bindings.
* Set Profiler.trace_steps=[52], warmup=1, active=5; let trim_warmup
default to True so the exported trace contains exactly 5 active
ProfilerStep events.
Verified end-to-end:
- Run with RUN_NAME=trace_run2 writes results/trace_run2/trace_step52_
rank{0..7}.json (~19 MB each), step labels match trace_steps gin.
- Triton cache persisted across runs: cold start ~6 min -> warm start
~2 min for autotune-to-first-step.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2048 was chosen for "round number near max_seq_len" but it slightly
overflows the per-sample budget: 3 * (2048//3) + 9 = 2055 > 2048, so
the dataset truncates ~7 UIH events to fit. 2039 makes the math exact
(3 * 679 + 9 = 2046 ≤ 2048) so no truncation.
Comment block expanded to document:
- The 3-pool gather semantic (L//3 events per pool, interleaved
chronologically).
- The like-pool under-fill observation: like events are only 1.9%
of yambda corpus and max user lifetime is ~28k events, so the
like pool fills to ~105 events per anchor on average (not 679).
TRITON's jagged attention skips the unfilled slots, so under-fill
costs sequence budget but not GPU compute.
No code change. Cache for L=2039 already built and reused.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ol gather Documents the fork's scope (yambda-5b on HSTU dlrm_v3 path), per-pool gather strategy with effective fill table, and dataset statistics. Sections indexed 1–5 for navigation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds env_int gin macro (companion to env_path) and wires make_optimizer_and_shard.hbm_cap_gb through it so the per-rank HBM ceiling can be tuned without editing the gin file. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Document the container image, dependency versions (native NGC torch 2.10, triton 3.6, source-built fbgemm_gpu, torchrec 1.4.0, polars-u64-idx), gin training configuration, and env vars needed to reproduce the 8x B200 run.
Adds three knobs, all driven from the gin file: - make_model.bf16_training: enable bf16 autocast for the DlrmHSTU model. - env_int macro: lets numeric gin values come from env vars (used by the existing hbm_cap_gb binding). - apply_env_bootstrap.TRITON_FULL_AUTOTUNE: when False (default), three layer-norm/jagged triton kernels are pinned to a single Config so cold starts land at the same steady-state deterministically. When True, the full autotune search runs again — use this when changing shape, GPU, or triton/torch version, then re-pin from the discovered winners. train_ranker._main_func now parses gin in two phases (skip_unknown=True early, full pass after the heavy imports) so the bootstrap env var is set BEFORE the triton kernel modules evaluate their @triton.autotune decorators at module load time. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors the B200 layout with MI350X (gfx950, ROCm 7.2.1) specifics: container image (rocm/primus:v26.3), fbgemm_gpu rebuild requirement (HEAD nightly_rocm-2026.6.1 for ~30% step-time win over the shipped 2026.5.14), the gin-driven TRITON_FULL_AUTOTUNE knob, and the measured perf ladder from fp32/PYTORCH baseline (~28 d/epoch) down to the pinned bf16/TRITON fast equilibrium (~7.6 d/epoch). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Merges per-rank chrome traces (results/<run>/trace_step{N}_rank{R}.json)
into a single Perfetto-loadable file, remapping pid/flow ids so
cross-rank events land on distinct tracks instead of collapsing onto one.
Used to produce the bf16 + pinned-autotune step-52 trace
(results/verify_rename/trace_step52.json.gz).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Refresh the B200 dependency versions to the latest validated stack (torch 2.12.0a0 / CUDA 13.2, fbgemm_gpu built for sm_100+CUDA 13.2, CUPTI 13.2), note 26.01 as an equivalent alternative, and record the TRITON_FULL_AUTOTUNE=True setting for B200.
…iver) Point fbgemm at the latest validated source commit (10b77573, 2026-06-01), record the tested torchrec 1.7.0.dev nightly (1.4.0 stable fallback), clarify the fbgemm wheel version string is the build date, and correct the host/forward-compat driver CUDA versions (13.0 host / 595.58.03 compat).
After upgrading to torch 2.12 / torchrec 1.7 (B200-aligned), the pinned configs from the torch 2.10 stack stopped landing on the fast equilibrium because the torchrec 1.7 code path invokes these kernels at different shape keys. Re-captured winners via a fresh autotune run and updated the pin sites: - _weighted_layer_norm_bwd_dx: BLOCK_N 8 -> 1 (num_warps 1 unchanged) - split_2D_jagged_multirow: BLOCK_N 1 / num_warps 2 -> BLOCK_N 8 / num_warps 1 - _layer_norm_bwd_dwdb: BLOCK_N 128, num_warps 8 (unchanged - same winner on both stacks) Verified: 3 consecutive checkpoints (steps 151/201/251) at 52.75-53.36 ms deterministic on the new stack. Same equilibrium band as the torch 2.10 stack (51.5-53.0 ms). Also adds a Stack B section to docs/training_recipe.md (MI350X) documenting the torch 2.12 swap recipe (torch + torchvision + torchaudio + fbgemm rebuild + torchrec git tag) so the MI350X recipe is dependency-aligned with the B200 path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bumps the Stack B (torch 2.12 / torchrec 1.7) section to: - fbgemm commit 10b77573 (same SHA as the B200 path) instead of 1509423 (one cosmetic commit behind). Wheel rename 2026.6.1 -> 2026.6.2. - Note that Stack A and Stack B use different pinned triton configs (already merged) and explain why (torchrec 1.7 invokes the kernels at different shape keys). - Caveat: HSTU_HAMMER_KERNEL=PYTORCH fallback regresses to ~169 ms on Stack B (vs 107 ms on Stack A). TRITON is unaffected and remains the default; this only matters for PYTORCH-backend debugging. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Collapses the two-stack MI350X section into one canonical dependency table: torch 2.12 / torchrec 1.7 / fbgemm @ 10b77573 — the same SHAs as the B200 path. The image-native torch 2.10 / torchrec 1.4 / fbgemm 2026.5.14 path still works for development but the recipe doc now documents the validated production stack only. PYTORCH-backend caveat preserved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Not relevant — TRITON is the documented default backend. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ecipe Embedding sizes match the true entity counts in yambda-5b: item_id 9_390_000 -> 9_390_624 artist_id 1_290_000 -> 1_293_395 album_id 3_370_000 -> 3_367_692 uid 1_000_000 -> 1_000_001 This eliminates the recurring "EmbeddingBoundsCheck ... Setting idx to zero" warnings at training time. Gin default raised to batch_size=1024 / eval_batch_size=1024. Measured steady-state on the torch 2.12 + torchrec 1.7 + fbgemm HEAD stack with TRITON HSTU + pinned triton configs: ~635 ms/step, ~12.9K sps, ~2.92 days/epoch vs ~7.6 days at bs=32. bs=2048 is feasible but only +3% throughput at much higher autotune cost, so bs=1024 is the sweet spot. Triton autotune pin for _weighted_layer_norm_bwd_dx now ships TWO configs in the pinned list — BLOCK_N=1 (bs=32 winner) and BLOCK_N=8 (bs=1024 winner). Triton's autotune key=[BLOCK_D] dispatches the right one per shape in <5 sec on cold start (vs ~30 sec from the full pool). The other two pinned kernels (_layer_norm_bwd_dwdb, split_2D_jagged_multirow) have identical winners at bs=32 and bs=1024 so they stay single-config. Training-recipe doc drops the batch_size rows from both MI350X and B200 config tables — the recipe is intentionally batch-size-agnostic now that the pin set covers a range. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Enable the multi-row, separated-RNG _ln_mul_dropout path on AMD MI350 (gfx950), previously Blackwell-only. Batches rows per program and reuses a precomputed dropout mask in the backward instead of one-program-per-row fused RNG; +5.6% end-to-end (-> 14,222 global sps) at bs=1024 on yambda-5b. - ops/utils.py: add is_amd_mi350() + use_separated_rng_ln_mul_dropout() gate. - ops/triton/triton_hstu_linear.py: dispatch the fwd LN-dropout to the separated-RNG path via the new gate. - ops/triton/triton_hstu_attention.py: pin fast nonkdim:16 fwd/persistent/bwd configs via pinned_or_full (TRITON_FULL_AUTOTUNE=1 still bypasses). Multi-config lists with an inline "add a new batch size" guide. - scripts/launch_smoke_8gpu.sh: GPU clock sanity guard - log perf level + sclk, auto-restore 'auto' if a perf_determinism/manual/low lock is found (a half-clock lock uniformly slowed every Triton kernel ~1.9x and masked perf changes). - docs/perf_opt.md: document the LN-dropout fix and the clock-lock caveat. Co-authored-by: Cursor <cursoragent@cursor.com>
…ernel Add an opt-in TrainPipelineSparseDist path that overlaps the embedding input-distribution all-to-all with dense fwd/bwd. To make the embedding collection pipelineable, the merged sparse KJT is now pre-built in the dataloader (Samples.merged_sparse_features) and the model consumes it via a _pipeline_mode forward that takes the batch as a single arg, so TorchRec's tracer resolves the lookup input as a plain getattr off the batch. - dataset.py: Samples.merged_sparse_features + merge_uih_candidate_kjts, built in collate_fn; wired into to()/record_stream()/pin_memory(). - dlrm_hstu.py: _pipeline_mode flag; forward unpacks the batch and preprocess accepts the prebuilt merged KJT (falls back to building it when absent). - utils.py: _PipelineModelWrapper, build_train_pipeline, train_eval_loop use_pipeline branch + eval batch-arg; seed all RNGs in setup() for reproducible weight init. - gin/launch: make_model.hammer_kernel selects TRITON vs PYTORCH (env override still honored); launch script defers to the gin default. use_pipeline defaults to False. Validated on MI350/ROCm 8-GPU: embedding collection is pipelined (input-dist a2a moves to hidden); model quality and throughput match the sequential path (seeded A/B). The exposed embedding-output a2a still dominates the step, so throughput is unchanged — pipelining is quality- and perf-neutral here. Co-authored-by: Cursor <cursoragent@cursor.com>
Add a forward-in-time streaming path: slice the timeline into fixed-duration windows (default 1 day), train window T then eval window T+1, enforcing no future leakage (across-window + causal-history guarantees). Make it the default mode in launch_smoke_8gpu.sh. Window-reset overhead is hidden via a persistent worker pool + double buffering (next window's index mask and first-batch prefetch overlap compute on a background thread) and eval-window prefetch one window ahead, dropping train/eval first-batch waits to ~1-3ms with no steady-state regression. Window selection uses a lazily-built, mmap'd anchor-timestamp cache so the default non-streaming path is unaffected. Also harden trace export (best-effort: IO/permission failures warn instead of crashing training) now that streaming enables output_trace by default, and document the path + knobs in the README. Co-authored-by: Cursor <cursoragent@cursor.com>
save_dmp_checkpoint.path now resolves from $CKPT_PATH and defaults to empty, so checkpoints (a full DMP is ~100s of GB, and the streaming loop always saves the final window) are off unless explicitly enabled. Also drop the stale training-recipe sentence claiming native torch is kept — it contradicts the dependency table, which replaces torch and keeps only the image's triton. Co-authored-by: Cursor <cursoragent@cursor.com>
Add in-process trace postprocessing in the profiler on_trace_ready callback to fix two ROCm/roctracer rendering artifacts that make MI350X traces look wrong in Perfetto (the timing is correct, only the layout): - _normalize_profilerstep_layout: collapse the fragmented GPU-side ProfilerStep#N spans (roctracer splits a step across the HIP null + compute streams) into one full-width span per step on the busiest compute stream, matching the CUDA look. - _deoverlap_gpu_slices: pull back sub-us kernel end timestamps so back-to-back kernels don't touch/overlap; Perfetto otherwise nests the later (long) kernel inside the tiny epilogue and clips it to zero width, hiding kernels like _hstu_attn_bwd. Leaves a ~1ns gap (exact end==start is just as fatal as an overlap) and leaves real nesting untouched. Both passes are gated behind _is_rocm() (torch.version.hip) so they are complete no-ops on CUDA/B200, which don't have these artifacts. All best-effort: failures degrade to a warning and never crash training. Co-authored-by: Cursor <cursoragent@cursor.com>
Add _deoverlap_gpu_annotations to the trace-export postprocessing, the annotation-boundary analog of the kernel de-overlap. Kineto projects the forward/backward phase annotations (## user_forward ##, ## item_forward ##, ## stu_* ##, ...) onto the GPU stream as a chain of end-to-end siblings. The absolute step timestamps are ~5.4e12 us, where a float64's quantum is ~1 ns, so a sibling boundary that should be coincident lands a few ns off; when the earlier sibling ends at/after the next one's start, Perfetto nests and clips the next span to a sliver -- e.g. the 100+ ms ## user_forward ## vanishes on some ranks/steps purely by rounding luck. Since annotations form a real nesting hierarchy (user_forward contains the stu_* spans and their kernels), this walks the per-track slice stack and only snaps a slice back when the next slice extends beyond it (siblings, not parent/child), guarding against trimming into a span's own descendants. It also snaps kernel tails that straddle an annotation boundary. Gated by _is_rocm() (no-op on B200/CUDA) and best-effort like the other passes. Verified end-to-end on an 8-rank MI350X run: ## user_forward ## renders 40/40 (was 9/40), total clipped annotations 1352 -> ~5. Co-authored-by: Cursor <cursoragent@cursor.com>
Make streaming-train-eval crash-resumable and add general checkpoint
cadence controls:
- Atomic checkpoint saves (.tmp dir + rename), keep_last_n pruning, and
swap-aside .old overwrite so a save can safely replace an existing
train_ts dir; stale .tmp/.old swept on the next save.
- Per-rank RNG snapshot/restore for bit-equal dropout replay on resume;
auto-latest-subdir resolution + (train_ts, batch_idx_in_window) resume
hint so a run re-enters a partial window and skips already-trained
batches exact-once.
- Three independent in-window checkpoint cadences via a pure, testable
decision helper: per-window batch count, monotonic global step
(e.g. every 1000 steps), and wall-clock interval (e.g. hourly,
rank-0-decided + broadcast to keep the save barrier in lockstep).
- gin/env bindings for all cadences + a test-only die_at_step hook.
Tests: checkpoint_cadence_test.py (cadence precedence/triggers) and an
end-to-end baseline/interrupt/resume harness (streaming_resume_test.{sh,py})
that gates on functional invariants (RNG restored, correct resumed step,
atomic save, keep_last_n) plus a loose trajectory-closeness bound.
Co-authored-by: Cursor <cursoragent@cursor.com>
…TEGY passthrough - gin: MIN_HISTORY default 0 -> 4086 (power-users floor at the full history budget; maps to the existing positions_L4086.npy cache, no rebuild/no shared-dir write). AUC_THRESHOLD left unchanged (0.80275) pending finalization. - launch_slurm.sh: forward $HISTORY_STRATEGY through the worker docker exec -e block (was silently dropped, so the knob never reached the worker); fix the stale lr-override echo (gin default is 1e-7, not 0.001). - README: document MIN_HISTORY default as 4086. Co-authored-by: Cursor <cursoragent@cursor.com>
Stop tracking the local walkthrough doc (kept on disk, no longer in the repo). Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the example username in launch_slurm.sh / streaming_resume_test.sh comments with a generic <user> placeholder. Runtime defaults already derive the container name + mounts from $USER, so only doc/example strings changed. Co-authored-by: Cursor <cursoragent@cursor.com>
Document a complete sbatch launch (run-shape + data-fraction eval cadence) for 1-node and 2-node, noting the launchers differ only in --nodes. Co-authored-by: Cursor <cursoragent@cursor.com>
MetricsLogger now brackets eval and checkpoint phases with pause/resume perf timers so the reported step_ms reflects pure train-step latency. Adds wall_step_ms (inclusive), eval_ms, and ckpt_ms to the perf log line and TensorBoard scalars (appended for parser backward-compat). Checkpoint saves and eval windows are wrapped with categorized pause/resume calls. Co-authored-by: Cursor <cursoragent@cursor.com>
… consistency Co-authored-by: Cursor <cursoragent@cursor.com>
orchestrate() now defaults to the reference run-shape (START_TS=0, NUM_TRAIN_TS=299, full windows) + the data-fraction eval cadence (EVAL_EVERY_DATA_PCT=0.005, per-window off), so `sbatch scripts/launch_slurm.sh` needs no env knobs. SMOKE=1 restores the previous fast functional defaults (short window, capped batches, per-window eval). The two eval cadences are auto-deconflicted (explicit EVAL_EVERY_N_WINDOWS>0 disables data-pct). gin library defaults + the resume/local smoke paths are unchanged. README updated to the bare single/multi-node commands. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # recommendation_v4/generative_recommenders/dlrm_v3/train/utils.py
…s robustness - Dockerfiles (AMD+NVIDIA): install mlperf_logging (--no-deps) so compliance logging is not silently disabled at runtime. - Pin mlcommons/logging to 6.0.0-rc6 in requirements.txt + both Dockerfiles for reproducibility. - mlperf_logging_utils: guard empty os.path.dirname(log_path); attach file handler on rank 0 only; get_mlperf_logger() returns None when the dep is unavailable so callers' `is not None` guards disable logging cleanly. - launch_slurm.sh: chmod 622 (not 666) on the job log — tee -a needs write only, avoids world-readable logs on shared NFS. Co-authored-by: Cursor <cursoragent@cursor.com>
Add decorrelate_runtime_rng(rank): after make_model + DMP init, re-seed torch/cuda with $SEED + rank so HSTU dropout draws different masks per data-parallel rank instead of the identical masks implied by the shared init seed. Runs strictly after init so replicated dense weights and sharded embeddings stay init-identical across ranks. Toggle via $DECORRELATE_DROPOUT (default 1; 0 = legacy identical-mask behavior). Offset is a pure fn of resolved $SEED + rank and per-rank RNG state is checkpointed, so reproducibility/resume are preserved. Co-authored-by: Cursor <cursoragent@cursor.com>
Remove the B200 Dockerfile.nvidia and the local scripts/run_docker.sh helper — neither is wired into launch_slurm.sh (which uses rocm/primus directly) and both are environment-specific. Gitignore them plus the local ad-hoc analysis artifacts so they are not re-added. Co-authored-by: Cursor <cursoragent@cursor.com>
…l metric
- Collapse verbose comment blocks across gin/launch_slurm/resume-test to 1-2
lines; revert container name to yambda_primus and drop #SBATCH --time.
- Default DECORRELATE_DROPOUT=0 (identical dropout masks across ranks).
- Generalize EVAL_ACCURACY_AUC_MODE to any {window,lifetime}_{auc,gauc,
accuracy,ne} metric, with NE handled as lower-is-better for early-stop.
Co-authored-by: Cursor <cursoragent@cursor.com>
Collapse the large explanatory comment/docstring blocks in the MLPerf logging wiring (logger, train_ranker boundaries, streaming loop hooks, MetricsLogger counters) to keep the PR reviewable. No logic changes. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Revert recommendation_v4/.gitignore to base. Local run artifacts and ad-hoc analysis files are kept out of the repo via .git/info/exclude (local, uncommitted) instead, removing one file from the PR diff. Co-authored-by: Cursor <cursoragent@cursor.com>
Move the MLPerf event stream into mlperf_logging_utils.py: a MLPerfRunTracker state machine owns the block/eval/run markers, progress metadata, and the convergence decision (replacing ~145 lines of closures in streaming_train_eval_loop), and MLPerfLogger.log_run_start emits submission info + hyperparameters + INIT_STOP/RUN_START (collapsing the inline block in train_ranker). Convergence/EVAL_ACCURACY is fixed to per-window AUC: drop the eval_accuracy_auc_mode knob (gin + loop param + launch_slurm passthrough). Submission identity: SUBMISSION_ORG defaults to AMD, SUBMISSION_PLATFORM to MI355X (was the org name — a bug), both overridable via $MLPERF_SUBMISSION_PLATFORM. Co-authored-by: Cursor <cursoragent@cursor.com>
The lifetime cumulative AUC always uses the exact binned backend now. Remove the TRAIN_LIFETIME_AUC_MODE / EVAL_LIFETIME_AUC_MODE env overrides (and the capped-only LIFETIME_AUC_WINDOW knob, now dead) from gin and launch_slurm. Co-authored-by: Cursor <cursoragent@cursor.com>
…rride" This reverts commit 900d4f1.
Reduce the upstream launch_slurm.sh diff to just what MLPerf logging needs: SCRATCH/REPO_MOUNT/DATA_MOUNT path portability (so outputs/log land off the hardcoded /home/chcai,/apps/chcai) and the MLPerf env wiring (MLPERF_LOG_PATH, AUC_THRESHOLD, MLPERF_LOGGING, MLPERF_SUBMISSION_PLATFORM, MLPERF_TRAIN_LOSS_LOG_FREQ). Reverted the unrelated baseline changes (NCCL GDR/IFNAME defaults, SMOKE/frozen run-shape, chmod/WORKER_TEE, HISTORY_STRATEGY, lifetime-AUC passthroughs) to Chris' base. Preserve the full kitchen-sink launcher as launch_slurm_suachong.sh for personal multi-node use (self-reinvoke paths repointed to itself). Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Persist mlperf_run_started in the checkpoint so a resume relaunch continues the SAME MLPerf run instead of re-emitting INIT_START/RUN_START (compliance requires EXACTLY_ONE). Cold-vs-resume is detected from the on-disk checkpoint before setup(); the log is truncated on a cold start and appended on a resume so the single run's event stream accumulates into one file. Legacy/cold checkpoints default the flag to False. Co-authored-by: Cursor <cursoragent@cursor.com>
Default SEED back to 1 for a fixed, reproducible weight init out of the box ($SEED=-1 still draws a fresh random seed per run). Default AUC_THRESHOLD to 1.0 (unreachable) in both the gin binding and the launch_slurm.sh fallback so a streaming-train-eval run trains through all windows by default instead of early-stopping; set $AUC_THRESHOLD=0.80275 for the MLPerf convergence target. Co-authored-by: Cursor <cursoragent@cursor.com>
Restore NCCL_NET_GDR_LEVEL=5 + NCCL_DMABUF_ENABLE=1 defaults so RCCL does true GPU<->NIC DMA over bnxt_re instead of host-memory staging (~+22% throughput at 2 nodes; 65.7%->79.8% weak-scaling efficiency). The brcmrdma host kernel ships the inbox peer-memory client, so GDR works with no container/host changes; non-fatal fallback to host staging if peermem is absent. Override with NCCL_NET_GDR_LEVEL=0. Co-authored-by: Cursor <cursoragent@cursor.com>
The slimmed launch_slurm.sh has smoke-shaped run defaults (START_TS=150, NUM_TRAIN_TS=1, NUM_TRAIN_BATCHES=20, per-window eval) and no SMOKE=1 toggle, so a bare submit is a short functional run — not the 299-window reference. Document the bare submit as the smoke run and give the explicit env-override command for the full reference sweep; drop the unimplemented SMOKE=1 instructions. Co-authored-by: Cursor <cursoragent@cursor.com>
dlrmv4: portable multi-node baseline + MLPerf compliance logging
Bring the HSTU/yambda-5b benchmark in line with MLPerf Training reference conventions: - add download_dataset.sh / verify_dataset.sh / run_and_time.sh wrappers - add md5sums checksum file (placeholder hashes) for dataset verification - restructure README.MD to the MLPerf spec (summaries, model+paper, hyperparameter table with tuning rules, quality target, eval frequency, steps-to-run) - freeze requirements.txt to the exact Dockerfile/training_recipe versions - add blank RCP placeholder (rcp/README.md) to be filled once convergence runs are generated Co-authored-by: Cursor <cursoragent@cursor.com>
|
MLCommons CLA bot: |
No description provided.