Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
ddcbdb4
Add recommendation_v4 (HSTU/DLRM-v3 generative-recommenders fork @ d9…
chriscai-amd May 29, 2026
5d80fa5
Enable Triton HSTU kernels on AMD/ROCm (gfx950 MI350X)
chriscai-amd May 29, 2026
2f516a9
Fix AttributeError on triton.knobs.nvidia.use_meta_ws
chriscai-amd May 29, 2026
1ca5efe
Make HSTU model arch + dataset history_length gin-tunable
chriscai-amd May 30, 2026
5a0173e
Make EmbeddingShardingPlanner hbm_cap_gb gin-tunable, set ddr_cap=0
chriscai-amd May 30, 2026
d0f176a
Add yambda 50m/500m/5b preprocessor + DLRM_DATA_PATH env override
chriscai-amd May 30, 2026
b25b4cf
Fix NCCL init: set CUDA device before init_process_group
chriscai-amd May 30, 2026
482cb6d
Profiler: write traces to local disk under <repo>/results/<run_name>/
chriscai-amd May 31, 2026
b4ea8ca
Tracing fixes: gin scoping, drop active=10 override, intuitive filena…
chriscai-amd May 31, 2026
0ca317f
gin: history_length 2048 → 2039 + expanded per-pool comment
chriscai-amd Jun 1, 2026
d505bff
README: rewrite for yambda-5b fork — upstream link, data prep, per-po…
chriscai-amd Jun 1, 2026
1229f59
gin: make hbm_cap_gb overridable via \$HBM_CAP_GB
chriscai-amd Jun 1, 2026
4de66a4
docs: add B200 training recipe for yambda-5b
chriscai-amd Jun 1, 2026
a5df1e3
bf16 + triton autotune pinning with gin-driven full-tune override
chriscai-amd Jun 1, 2026
72d1c2b
docs: add MI350X training recipe section
chriscai-amd Jun 1, 2026
7c278a0
scripts: add stitch_traces.py
chriscai-amd Jun 1, 2026
ac58fd7
docs: update B200 recipe deps to NGC 26.04 (torch 2.12 / CUDA 13.2)
chriscai-amd Jun 1, 2026
4c08c6c
docs: refresh B200 recipe deps (fbgemm HEAD, torchrec 1.7 nightly, dr…
chriscai-amd Jun 2, 2026
fa5ee7c
MI350X: re-pin 2 triton configs for the torch 2.12 + torchrec 1.7 stack
chriscai-amd Jun 2, 2026
674df28
docs: update MI350X Stack B to fbgemm @ B200 commit + caveat
chriscai-amd Jun 2, 2026
b3af839
docs: drop Stack A; MI350X recipe is now single-stack (B200-aligned)
chriscai-amd Jun 2, 2026
04d0142
docs: drop PYTORCH-fallback caveat from MI350X recipe
chriscai-amd Jun 2, 2026
cde77e9
MI350X: fit-entity embedding sizes, bs=1024 default, batch-agnostic r…
chriscai-amd Jun 2, 2026
704d955
MI350X: separated-RNG LN-dropout + attention autotune pin + clock guard
chriscai-amd Jun 2, 2026
3936538
dlrmv4: TorchRec 3-stage sparse-dist pipeline + gin-selectable HSTU k…
chriscai-amd Jun 2, 2026
a212c8a
dlrmv4: streaming (temporal-order) training for yambda-5b
chriscai-amd Jun 3, 2026
527ed0c
dlrmv4: disable checkpointing by default; fix recipe torch note
chriscai-amd Jun 3, 2026
d4c70ba
dlrmv4: ROCm-only Perfetto trace render fixes at export time
chriscai-amd Jun 3, 2026
99de092
dlrmv4: ROCm annotation de-overlap so phase spans render full width
chriscai-amd Jun 4, 2026
0e71b4c
dlrmv4: streaming checkpoint resume + step/time checkpoint cadences
chriscai-amd Jun 4, 2026
4f44e96
dlrmv4: move streaming resume test harness into train/tests
chriscai-amd Jun 4, 2026
925d285
dlrmv4: sparse full-holdout eval cadence + eval-pool fork-race fix
chriscai-amd Jun 4, 2026
41d034d
dlrmv4: self-healing streaming-e2e supervisor + NE/AUC trajectory bui…
chriscai-amd Jun 4, 2026
d425c4f
dlrmv4: durable streaming-run metrics (append log + TensorBoard on NFS)
chriscai-amd Jun 4, 2026
4211afb
dlrmv4: raise streaming-run disk guard for keep_last_n=1 saves
chriscai-amd Jun 4, 2026
d9274db
dlrmv4: anchor eval points to train global step in NE/AUC trajectory
chriscai-amd Jun 4, 2026
80981bb
dlrmv4: supervisor tolerates control-plane outages + attach mode
chriscai-amd Jun 4, 2026
2d57b09
dlrmv4: TFLOPS/MFU/HFU reporting + MAX_SEQ_LEN/HISTORY_LENGTH gin knobs
chriscai-amd Jun 5, 2026
b84d322
dlrmv4: configurable lifetime-AUC backend + fixed-holdout streaming eval
chriscai-amd Jun 9, 2026
c8e965c
dlrmv4: default yambda-5b to the 4k-no-truncation seq shape
chriscai-amd Jun 9, 2026
cf9f6c2
dlrmv4: reservation-aware sbatch failover in streaming supervisor
chriscai-amd Jun 9, 2026
ca1e27d
dlrmv4: two-tier reservation-then-open-pool failover in streaming sup…
chriscai-amd Jun 9, 2026
58d6c65
dlrmv4: cap failover at <=1 reservation node + fix trainer-alive self…
chriscai-amd Jun 9, 2026
1d0dd71
dlrmv4: anchor sparse-eval cadence to absolute window ts (resume-inva…
chriscai-amd Jun 9, 2026
d7cb873
dlrmv4: multi-node (N>=1) training over RoCE RDMA via consolidated la…
chriscai-amd Jun 10, 2026
7a7c1fb
local dlrmv4 changes: docker setup, run scripts, walkthrough docs, sm…
suachong Jun 10, 2026
9e9b9eb
dlrmv4: decouple multi-node launch from per-user paths for portable b…
Jun 10, 2026
8ea46ec
dlrmv4: deterministic in-window shuffle diversity dial + opt-in diagn…
chriscai-amd Jun 10, 2026
4f2ff3e
dlrmv4: MLPerf training compliance logging for streaming-train-eval
Jun 11, 2026
fa9b610
dlrmv4: min_history anchor-eligibility floor (decoupled from history_…
chriscai-amd Jun 13, 2026
994b866
dlrmv4: gin defaults — min_history=1, BATCH_SIZE env knob, default to…
chriscai-amd Jun 13, 2026
977a41f
dlrmv4: launch_slurm container hygiene + readiness gating + env passt…
chriscai-amd Jun 13, 2026
d5443fc
dlrmv4: durable per-boundary eval metrics JSONL sink + aggregate emb-…
chriscai-amd Jun 13, 2026
cf9dea4
dlrmv4: default MIN_HISTORY=0 (include cold-start first events) to ma…
chriscai-amd Jun 13, 2026
bdc69b8
dlrmv4: default NUM_TRAIN_TS=149 to sweep full ts=150..298 streaming …
chriscai-amd Jun 13, 2026
00b474e
dlrmv4: consolidate eval cadence into single EVAL_EVERY_N_WINDOWS knob
chriscai-amd Jun 13, 2026
3587a55
dlrmv4: random per-run seed + graceful teardown + no double-logging
Jun 15, 2026
24abd3d
Merge remote-tracking branch 'origin/chcai/dlrmv4' into suachong/dlrmv4
Jun 15, 2026
d5b6231
dlrmv4: fix single-node-without-slurm via run_docker.sh
suachong Jun 15, 2026
aae0e87
dlrmv4: gin-configurable quantized (bf16/fp16) embedding all-to-all
chriscai-amd Jun 16, 2026
c7c6afb
dlrmv4: split quantized a2a into independent fwd/bwd precision knobs
chriscai-amd Jun 16, 2026
b8f0e50
dlrmv4: enable GPUDirect RDMA by default in slurm worker
Jun 16, 2026
5380d37
dlrmv4: gin-configurable RNG seed ($SEED) + default TRAIN_SPLIT_PERCE…
chriscai-amd Jun 16, 2026
49600d5
dlrmv4: env-configurable dense/sparse LR + optimizer LR logging
chriscai-amd Jun 17, 2026
2b85d1c
dlrmv4: env-configurable HSTU transformer depth ($HSTU_NUM_LAYERS)
chriscai-amd Jun 17, 2026
d98013d
dlrmv4: env-overridable dense/sparse LRs for sweeps + holdout default…
Jun 18, 2026
835ce31
Merge remote-tracking branch 'origin/chcai/dlrmv4' into suachong/dlrmv4
Jun 18, 2026
ad19525
dlrmv4: disable TensorBoard by default (no-op writer)
chriscai-amd Jun 18, 2026
cd630de
dlrmv4: default dense/sparse LR=1e-5 and HSTU depth=3
chriscai-amd Jun 18, 2026
398c61f
dlrmv4: report EVAL_ACCURACY as per-window AUC (configurable, default…
Jun 18, 2026
64c251f
dlrmv4: consolidate streaming e2e supervisor to one sbatch-wrapping s…
chriscai-amd Jun 18, 2026
1b512e8
dlrmv4: add non-SLURM local launcher + self-healing supervisor
chriscai-amd Jun 19, 2026
d072943
dlrmv4: set default-PG timeout to TIMEOUT (1800s) to survive checkpoi…
chriscai-amd Jun 19, 2026
405e98d
Merge remote-tracking branch 'origin/chcai/dlrmv4' into suachong/dlrmv4
Jun 22, 2026
3d8aca1
dlrmv4: optional gradient clipping for the streaming path ($GRAD_CLIP…
chriscai-amd Jun 22, 2026
a7ddc7a
dlrmv4: data-fraction eval cadence + lr1e-7/grad-clip-on defaults
chriscai-amd Jun 22, 2026
5db7c69
dlrmv4: seed embedding init + reproducibility checksum ($SEED)
nehaprakriya Jun 22, 2026
cd8f95f
dlrmv4: default INIT_CHECKSUM off (fp64 shard copy OOMs the build)
chriscai-amd Jun 22, 2026
cd93179
dlrmv4: add last_n UIH history strategy ($HISTORY_STRATEGY)
nehaprakriya Jun 23, 2026
3795773
Merge remote-tracking branch 'origin/chcai/dlrmv4' into suachong/dlrmv4
suachong Jun 23, 2026
35b537d
dlrmv4: logging-freeze prep — MIN_HISTORY=4086 default + HISTORY_STRA…
suachong Jun 24, 2026
e4371b2
dlrmv4: untrack docs/v4_vs_v2_and_hstu_walkthrough.md
suachong Jun 24, 2026
284bc06
dlrmv4: scrub hardcoded username from reference comments
suachong Jun 24, 2026
12740df
dlrmv4: README — add full single/multi-node reference run example
suachong Jun 24, 2026
bd43cfe
dlrmv4: exclude eval/checkpoint overhead from step_ms timing
nehaprakriya Jun 24, 2026
776f263
dlrmv4: README — use AUC_THRESHOLD=0.80275 in example for gin-default…
suachong Jun 24, 2026
2514a01
dlrmv4: make a bare sbatch reproduce the frozen reference run
suachong Jun 24, 2026
2174392
Merge remote-tracking branch 'origin/chcai/dlrmv4' into suachong/dlrmv4
suachong Jun 24, 2026
4df112e
dlrmv4: address PR review — mlperf_logging install/pin + logging-util…
suachong Jun 24, 2026
494f606
dlrmv4: decorrelate per-rank runtime RNG for HSTU dropout
suachong Jun 24, 2026
c563c02
dlrmv4: drop non-portable build/run helpers from the baseline
suachong Jun 24, 2026
ed62e7d
dlrmv4: trim PR comments, default DECORRELATE_DROPOUT off, extend eva…
suachong Jun 24, 2026
c4d8700
dlrmv4: trim verbose MLPerf-wiring comments to 1-2 lines
suachong Jun 24, 2026
af596c8
dlrmv4: trim seed_everything / decorrelate_runtime_rng docstrings
suachong Jun 24, 2026
6b2e587
dlrmv4: drop .gitignore changes from the PR
suachong Jun 24, 2026
13cb2b0
dlrmv4: centralize MLPerf emission + fix submission identity
suachong Jun 24, 2026
900d4f1
dlrmv4: hardcode lifetime-AUC backend to binned, drop the override
suachong Jun 24, 2026
a584795
Revert "dlrmv4: hardcode lifetime-AUC backend to binned, drop the ove…
suachong Jun 24, 2026
14f129d
dlrmv4: slim launch_slurm.sh to MLPerf wiring + path portability
suachong Jun 24, 2026
97b7057
dlrmv4: drop launch_slurm_suachong.sh from the PR (keep local only)
suachong Jun 24, 2026
488cacb
dlrmv4: revert streaming_resume_test.sh to base (out of MLPerf PR scope)
suachong Jun 24, 2026
1730b89
dlrmv4: make MLPerf run markers resume-aware via checkpoint state
nehaprakriya Jun 25, 2026
baa701f
dlrmv4: reproducible-by-default config (seed=1, AUC_THRESHOLD=1.0)
nehaprakriya Jun 25, 2026
afbec22
dlrmv4: re-enable GPUDirect RDMA by default in slurm worker
nehaprakriya Jun 25, 2026
937f28e
dlrmv4: README — match launcher's actual smoke-shaped defaults
nehaprakriya Jun 25, 2026
1632206
Merge pull request #1 from chriscai-amd/suachong/dlrmv4
chriscai-amd Jun 25, 2026
ef5ccdc
recommendation_v4: add MLPerf reference scripts, structure, and docs
nehaprakriya Jun 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,6 @@
path = text_to_image/torchtitan
url = https://github.com/pytorch/torchtitan.git
branch = mlperf-training-flux.1
[submodule "recommendation_v4/cutlass"]
path = recommendation_v4/generative_recommenders/ops/cpp/cutlass
url = https://github.com/NVIDIA/cutlass.git
159 changes: 159 additions & 0 deletions recommendation_v4/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# Don't check in parsed data files and other temporary files
tmp/
exps/
ckpts/
results/

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/
86 changes: 86 additions & 0 deletions recommendation_v4/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# MI350X path — implements docs/training_recipe.md §"MI350X".

FROM rocm/primus:v26.3

ENV PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1

WORKDIR /workspace/recommendation_v4

# torch / torchvision / torchaudio — training_recipe.md:38-40.
RUN pip install --upgrade --no-deps \
--index-url https://download.pytorch.org/whl/rocm7.2 \
torch==2.12.0+rocm7.2 \
torchvision==0.27.0+rocm7.2 \
torchaudio==2.11.0+rocm7.2

# torchrec — training_recipe.md:43.
RUN pip install --force-reinstall --no-deps \
"git+https://github.com/pytorch/torchrec.git@v2026.06.01.00"

# fbgemm_gpu — training_recipe.md:42. Build from FBGEMM commit 10b77573 for
# gfx950 against the replaced torch. ~30-60 min.
RUN apt-get update && apt-get install -y --no-install-recommends git build-essential && \
rm -rf /var/lib/apt/lists/* && \
git clone --recursive https://github.com/pytorch/FBGEMM.git /tmp/FBGEMM && \
cd /tmp/FBGEMM && \
git checkout 10b775730212923f65f7b78f79b6a01d80cf3c29 && \
git submodule update --init --recursive && \
cd fbgemm_gpu && \
# Filter `fairscale` and the torch family from fbgemm's requirements.txt:
# fairscale pulls a CPU torch that would clobber the +rocm7.2 wheel installed
# above. fairscale is a distributed-training lib used by fbgemm tests, not
# by the build itself.
grep -v -E '^(fairscale|torch|torchvision|torchaudio)([<>=!]|$)' requirements.txt > /tmp/req.txt && \
pip install -r /tmp/req.txt && \
python setup.py -j 32 bdist_wheel \
--build-target=default \
--build-variant=rocm \
-DHIP_ROOT_DIR=/opt/rocm \
-DAMDGPU_TARGETS=gfx950 && \
pip install --force-reinstall --no-deps dist/fbgemm_gpu_nightly_rocm*.whl && \
cd / && rm -rf /tmp/FBGEMM

# polars-u64-idx — training_recipe.md:44 (mandatory; yambda-5b > 4.29 B rows).
# Remaining packages — training_recipe.md:156-159 ("Additional Python deps") plus
# `datasets` + `huggingface_hub`, which the recipe does not list but
# preprocess_public_data.py:278 imports to download yambda from HuggingFace.
RUN pip install \
polars-u64-idx==1.33.1 \
gin-config \
absl-py \
datasets \
huggingface_hub \
pyre-extensions \
iopath \
typing-inspect \
psutil \
tqdm \
pyyaml \
lightning-utilities && \
# torchmetrics and tensordict declare `torch` as a dep; without --no-deps
# pip pulls torch==2.12.0+cu130 from PyPI which clobbers the +rocm7.2 wheel
# we installed above (libtorch_hip.so disappears, fbgemm_gpu fails to load).
pip install --no-deps \
torchmetrics==1.0.3 \
tensordict

# mlperf_logging — required by train/mlperf_logging_utils.py for MLPerf
# compliance logs. Pinned to the Training 6.0 tag for reproducibility; --no-deps
# so pip does not resolve requirements.txt's torch/fbgemm_gpu/torchrec pins and
# clobber the +rocm7.2 wheels above.
RUN pip install --no-deps "git+https://github.com/mlcommons/logging.git@6.0.0-rc6"

# Smoke-test the 6 imports the launch script checks at
# scripts/launch_smoke_8gpu.sh:26.
RUN python -c "import torch, fbgemm_gpu, torchrec, polars, xxhash, gin; \
print('torch', torch.__version__, '| hip', getattr(torch.version, 'hip', None))"

COPY . /workspace/recommendation_v4

ENV PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
HSTU_HAMMER_KERNEL=TRITON \
DLRM_DATA_PATH=/data/mlperf_dlrm_v4

CMD ["bash"]
Loading
Loading