Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
121 commits
Select commit Hold shift + click to select a range
461b0f4
Add recommendation_v4 (HSTU/DLRM-v3 generative-recommenders fork @ d9…
chriscai-amd May 29, 2026
9b56d4f
Enable Triton HSTU kernels on AMD/ROCm (gfx950 MI350X)
chriscai-amd May 29, 2026
751f2f0
Fix AttributeError on triton.knobs.nvidia.use_meta_ws
chriscai-amd May 29, 2026
0f1fcf0
Make HSTU model arch + dataset history_length gin-tunable
chriscai-amd May 30, 2026
b7f8b2d
Make EmbeddingShardingPlanner hbm_cap_gb gin-tunable, set ddr_cap=0
chriscai-amd May 30, 2026
5ba9b07
Add yambda 50m/500m/5b preprocessor + DLRM_DATA_PATH env override
chriscai-amd May 30, 2026
c639f6a
Fix NCCL init: set CUDA device before init_process_group
chriscai-amd May 30, 2026
d76f96b
Profiler: write traces to local disk under <repo>/results/<run_name>/
chriscai-amd May 31, 2026
4a2a7bd
Tracing fixes: gin scoping, drop active=10 override, intuitive filena…
chriscai-amd May 31, 2026
9b0ae8b
gin: history_length 2048 → 2039 + expanded per-pool comment
chriscai-amd Jun 1, 2026
2b58a63
README: rewrite for yambda-5b fork — upstream link, data prep, per-po…
chriscai-amd Jun 1, 2026
03662de
gin: make hbm_cap_gb overridable via \$HBM_CAP_GB
chriscai-amd Jun 1, 2026
bb012a2
docs: add B200 training recipe for yambda-5b
chriscai-amd Jun 1, 2026
8d8844b
bf16 + triton autotune pinning with gin-driven full-tune override
chriscai-amd Jun 1, 2026
1c9315a
docs: add MI350X training recipe section
chriscai-amd Jun 1, 2026
3027519
scripts: add stitch_traces.py
chriscai-amd Jun 1, 2026
8cbbab0
docs: update B200 recipe deps to NGC 26.04 (torch 2.12 / CUDA 13.2)
chriscai-amd Jun 1, 2026
fc4d092
docs: refresh B200 recipe deps (fbgemm HEAD, torchrec 1.7 nightly, dr…
chriscai-amd Jun 2, 2026
194fc9b
MI350X: re-pin 2 triton configs for the torch 2.12 + torchrec 1.7 stack
chriscai-amd Jun 2, 2026
671e7e6
docs: update MI350X Stack B to fbgemm @ B200 commit + caveat
chriscai-amd Jun 2, 2026
b3d1764
docs: drop Stack A; MI350X recipe is now single-stack (B200-aligned)
chriscai-amd Jun 2, 2026
7f6553e
docs: drop PYTORCH-fallback caveat from MI350X recipe
chriscai-amd Jun 2, 2026
cb68c8e
MI350X: fit-entity embedding sizes, bs=1024 default, batch-agnostic r…
chriscai-amd Jun 2, 2026
2e95a4d
MI350X: separated-RNG LN-dropout + attention autotune pin + clock guard
chriscai-amd Jun 2, 2026
17e04af
dlrmv4: TorchRec 3-stage sparse-dist pipeline + gin-selectable HSTU k…
chriscai-amd Jun 2, 2026
123c55c
dlrmv4: streaming (temporal-order) training for yambda-5b
chriscai-amd Jun 3, 2026
d017bcc
dlrmv4: disable checkpointing by default; fix recipe torch note
chriscai-amd Jun 3, 2026
7d11c17
dlrmv4: ROCm-only Perfetto trace render fixes at export time
chriscai-amd Jun 3, 2026
e024217
dlrmv4: ROCm annotation de-overlap so phase spans render full width
chriscai-amd Jun 4, 2026
f447075
dlrmv4: streaming checkpoint resume + step/time checkpoint cadences
chriscai-amd Jun 4, 2026
4a18b95
dlrmv4: move streaming resume test harness into train/tests
chriscai-amd Jun 4, 2026
f89da0e
dlrmv4: sparse full-holdout eval cadence + eval-pool fork-race fix
chriscai-amd Jun 4, 2026
039f7c9
dlrmv4: self-healing streaming-e2e supervisor + NE/AUC trajectory bui…
chriscai-amd Jun 4, 2026
0be8a71
dlrmv4: durable streaming-run metrics (append log + TensorBoard on NFS)
chriscai-amd Jun 4, 2026
da8a9eb
dlrmv4: raise streaming-run disk guard for keep_last_n=1 saves
chriscai-amd Jun 4, 2026
3c896c9
dlrmv4: anchor eval points to train global step in NE/AUC trajectory
chriscai-amd Jun 4, 2026
04dc53a
dlrmv4: supervisor tolerates control-plane outages + attach mode
chriscai-amd Jun 4, 2026
45e0daf
dlrmv4: TFLOPS/MFU/HFU reporting + MAX_SEQ_LEN/HISTORY_LENGTH gin knobs
chriscai-amd Jun 5, 2026
e794c0a
dlrmv4: configurable lifetime-AUC backend + fixed-holdout streaming eval
chriscai-amd Jun 9, 2026
efc5126
dlrmv4: default yambda-5b to the 4k-no-truncation seq shape
chriscai-amd Jun 9, 2026
03362da
dlrmv4: reservation-aware sbatch failover in streaming supervisor
chriscai-amd Jun 9, 2026
888f701
dlrmv4: two-tier reservation-then-open-pool failover in streaming sup…
chriscai-amd Jun 9, 2026
7c36891
dlrmv4: cap failover at <=1 reservation node + fix trainer-alive self…
chriscai-amd Jun 9, 2026
7c9188f
dlrmv4: anchor sparse-eval cadence to absolute window ts (resume-inva…
chriscai-amd Jun 9, 2026
dbb9e1d
dlrmv4: multi-node (N>=1) training over RoCE RDMA via consolidated la…
chriscai-amd Jun 10, 2026
15c4a00
local dlrmv4 changes: docker setup, run scripts, walkthrough docs, sm…
suachong Jun 10, 2026
d03fb28
dlrmv4: decouple multi-node launch from per-user paths for portable b…
suachong Jun 10, 2026
c55736e
dlrmv4: deterministic in-window shuffle diversity dial + opt-in diagn…
chriscai-amd Jun 10, 2026
851a354
dlrmv4: MLPerf training compliance logging for streaming-train-eval
suachong Jun 11, 2026
2fc3cb1
dlrmv4: min_history anchor-eligibility floor (decoupled from history_…
chriscai-amd Jun 13, 2026
e5214b8
dlrmv4: gin defaults — min_history=1, BATCH_SIZE env knob, default to…
chriscai-amd Jun 13, 2026
5fdf70c
dlrmv4: launch_slurm container hygiene + readiness gating + env passt…
chriscai-amd Jun 13, 2026
6182ab4
dlrmv4: durable per-boundary eval metrics JSONL sink + aggregate emb-…
chriscai-amd Jun 13, 2026
51e6e5f
dlrmv4: default MIN_HISTORY=0 (include cold-start first events) to ma…
chriscai-amd Jun 13, 2026
03a5be1
dlrmv4: default NUM_TRAIN_TS=149 to sweep full ts=150..298 streaming …
chriscai-amd Jun 13, 2026
20f6917
dlrmv4: consolidate eval cadence into single EVAL_EVERY_N_WINDOWS knob
chriscai-amd Jun 13, 2026
9171e4f
dlrmv4: random per-run seed + graceful teardown + no double-logging
suachong Jun 15, 2026
29ebfe2
Merge remote-tracking branch 'origin/chcai/dlrmv4' into suachong/dlrmv4
suachong Jun 15, 2026
ded9c30
dlrmv4: fix single-node-without-slurm via run_docker.sh
suachong Jun 15, 2026
380b1ca
dlrmv4: gin-configurable quantized (bf16/fp16) embedding all-to-all
chriscai-amd Jun 16, 2026
3e776a8
dlrmv4: split quantized a2a into independent fwd/bwd precision knobs
chriscai-amd Jun 16, 2026
5f76993
dlrmv4: enable GPUDirect RDMA by default in slurm worker
suachong Jun 16, 2026
a34facc
dlrmv4: gin-configurable RNG seed ($SEED) + default TRAIN_SPLIT_PERCE…
chriscai-amd Jun 16, 2026
40c696a
dlrmv4: env-configurable dense/sparse LR + optimizer LR logging
chriscai-amd Jun 17, 2026
d646b2a
dlrmv4: env-configurable HSTU transformer depth ($HSTU_NUM_LAYERS)
chriscai-amd Jun 17, 2026
922aeec
dlrmv4: env-overridable dense/sparse LRs for sweeps + holdout default…
suachong Jun 18, 2026
8a9282d
Merge remote-tracking branch 'origin/chcai/dlrmv4' into suachong/dlrmv4
suachong Jun 18, 2026
e0d6e46
dlrmv4: disable TensorBoard by default (no-op writer)
chriscai-amd Jun 18, 2026
bc743b5
dlrmv4: default dense/sparse LR=1e-5 and HSTU depth=3
chriscai-amd Jun 18, 2026
b4ce3ba
dlrmv4: report EVAL_ACCURACY as per-window AUC (configurable, default…
suachong Jun 18, 2026
c5469a4
dlrmv4: consolidate streaming e2e supervisor to one sbatch-wrapping s…
chriscai-amd Jun 18, 2026
be892d8
dlrmv4: add non-SLURM local launcher + self-healing supervisor
chriscai-amd Jun 19, 2026
7f73ac7
dlrmv4: set default-PG timeout to TIMEOUT (1800s) to survive checkpoi…
chriscai-amd Jun 19, 2026
66b18eb
Merge remote-tracking branch 'origin/chcai/dlrmv4' into suachong/dlrmv4
suachong Jun 22, 2026
09b4204
dlrmv4: optional gradient clipping for the streaming path ($GRAD_CLIP…
chriscai-amd Jun 22, 2026
3cb9eb6
dlrmv4: data-fraction eval cadence + lr1e-7/grad-clip-on defaults
chriscai-amd Jun 22, 2026
c2342c5
dlrmv4: seed embedding init + reproducibility checksum ($SEED)
chriscai-amd Jun 22, 2026
eef3304
dlrmv4: default INIT_CHECKSUM off (fp64 shard copy OOMs the build)
chriscai-amd Jun 22, 2026
38a9175
dlrmv4: add last_n UIH history strategy ($HISTORY_STRATEGY)
chriscai-amd Jun 23, 2026
fde24f3
Merge remote-tracking branch 'origin/chcai/dlrmv4' into suachong/dlrmv4
suachong Jun 23, 2026
f14485a
dlrmv4: logging-freeze prep — MIN_HISTORY=4086 default + HISTORY_STRA…
suachong Jun 24, 2026
53f5176
dlrmv4: untrack docs/v4_vs_v2_and_hstu_walkthrough.md
suachong Jun 24, 2026
0f7a6eb
dlrmv4: scrub hardcoded username from reference comments
suachong Jun 24, 2026
31a38fa
dlrmv4: README — add full single/multi-node reference run example
suachong Jun 24, 2026
3cd9f6d
dlrmv4: exclude eval/checkpoint overhead from step_ms timing
chriscai-amd Jun 24, 2026
dc412ff
dlrmv4: README — use AUC_THRESHOLD=0.80275 in example for gin-default…
suachong Jun 24, 2026
fad177f
dlrmv4: make a bare sbatch reproduce the frozen reference run
suachong Jun 24, 2026
e078dd4
Merge remote-tracking branch 'origin/chcai/dlrmv4' into suachong/dlrmv4
suachong Jun 24, 2026
b50fe58
dlrmv4: address PR review — mlperf_logging install/pin + logging-util…
suachong Jun 24, 2026
9d1dbf8
dlrmv4: decorrelate per-rank runtime RNG for HSTU dropout
suachong Jun 24, 2026
784516a
dlrmv4: drop non-portable build/run helpers from the baseline
suachong Jun 24, 2026
3f36f61
dlrmv4: trim PR comments, default DECORRELATE_DROPOUT off, extend eva…
suachong Jun 24, 2026
a9f8efd
dlrmv4: trim verbose MLPerf-wiring comments to 1-2 lines
suachong Jun 24, 2026
024e64b
dlrmv4: trim seed_everything / decorrelate_runtime_rng docstrings
suachong Jun 24, 2026
f63e5d4
dlrmv4: drop .gitignore changes from the PR
suachong Jun 24, 2026
02f2d2b
dlrmv4: centralize MLPerf emission + fix submission identity
suachong Jun 24, 2026
ae99293
dlrmv4: hardcode lifetime-AUC backend to binned, drop the override
suachong Jun 24, 2026
da74883
Revert "dlrmv4: hardcode lifetime-AUC backend to binned, drop the ove…
suachong Jun 24, 2026
1ada43f
dlrmv4: slim launch_slurm.sh to MLPerf wiring + path portability
suachong Jun 24, 2026
1d25713
dlrmv4: drop launch_slurm_suachong.sh from the PR (keep local only)
suachong Jun 24, 2026
1dae61d
dlrmv4: revert streaming_resume_test.sh to base (out of MLPerf PR scope)
suachong Jun 24, 2026
7ec6fcc
dlrmv4: make MLPerf run markers resume-aware via checkpoint state
chriscai-amd Jun 25, 2026
5c2b940
dlrmv4: reproducible-by-default config (seed=1, AUC_THRESHOLD=1.0)
chriscai-amd Jun 25, 2026
ff55513
dlrmv4: re-enable GPUDirect RDMA by default in slurm worker
chriscai-amd Jun 25, 2026
7b5e869
dlrmv4: README — match launcher's actual smoke-shaped defaults
chriscai-amd Jun 25, 2026
392241f
Merge pull request #1 from chriscai-amd/suachong/dlrmv4
chriscai-amd Jun 25, 2026
881d925
recommendation_v4: add MLPerf reference scripts, structure, and docs
chriscai-amd Jun 25, 2026
4ca6c08
Merge pull request #2 from chriscai-amd/chcai/mlperf_refactor
chriscai-amd Jun 25, 2026
4cf4d85
recommendation_v4: prune inference/AOT/CUDA-cpp/research + non-yambda…
chriscai-amd Jun 25, 2026
b68bfb7
Make data-fraction eval cadence the default
chriscai-amd Jun 26, 2026
7e8de35
dlrmv3 streaming: fix distributed sync + generalize checkpoint/resume…
chriscai-amd Jun 26, 2026
f8807ca
dlrmv3 streaming: make resume e2e test pass on MI350/NFS
chriscai-amd Jun 26, 2026
ea37064
dlrmv3 streaming: document midwindow vs multiwindow resume test purpose
chriscai-amd Jun 26, 2026
e46358f
Merge pull request #3 from chriscai-amd/chcai/nv_fix
chriscai-amd Jun 26, 2026
8db66f3
dlrmv3: gin/env-configurable embedding table placement (HBM/UVM)
chriscai-amd Jun 27, 2026
d0ce11e
dlrmv3 qcomm: hard-fail instead of silently falling back to fp32 a2a
chriscai-amd Jun 29, 2026
584cb66
dlrmv3 yambda-5b: default embedding a2a quantization to fp16/fp16
chriscai-amd Jun 29, 2026
3a69560
dlrmv3 yambda-5b: default gin to the canonical fp16 full-corpus run
chriscai-amd Jun 29, 2026
4f987c1
dlrmv3 qcomm: low-memory fbgemm fp16/bf16 a2a codec to fix skewed-bat…
chriscai-amd Jun 30, 2026
a856229
dlrmv3 launch_slurm: forward QCOMM_LOWMEM_CODEC env into the container
chriscai-amd Jun 30, 2026
b8f8298
dlrmv3 sharding: gin/env-configurable per-table embedding sharding-ty…
chriscai-amd Jul 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,6 @@
path = text_to_image/torchtitan
url = https://github.com/pytorch/torchtitan.git
branch = mlperf-training-flux.1
[submodule "recommendation_v4/cutlass"]
path = recommendation_v4/generative_recommenders/ops/cpp/cutlass
url = https://github.com/NVIDIA/cutlass.git
159 changes: 159 additions & 0 deletions recommendation_v4/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# Don't check in parsed data files and other temporary files
tmp/
exps/
ckpts/
results/

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/
86 changes: 86 additions & 0 deletions recommendation_v4/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# MI350X path — implements docs/training_recipe.md §"MI350X".

FROM rocm/primus:v26.3

ENV PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1

WORKDIR /workspace/recommendation_v4

# torch / torchvision / torchaudio — training_recipe.md:38-40.
RUN pip install --upgrade --no-deps \
--index-url https://download.pytorch.org/whl/rocm7.2 \
torch==2.12.0+rocm7.2 \
torchvision==0.27.0+rocm7.2 \
torchaudio==2.11.0+rocm7.2

# torchrec — training_recipe.md:43.
RUN pip install --force-reinstall --no-deps \
"git+https://github.com/pytorch/torchrec.git@v2026.06.01.00"

# fbgemm_gpu — training_recipe.md:42. Build from FBGEMM commit 10b77573 for
# gfx950 against the replaced torch. ~30-60 min.
RUN apt-get update && apt-get install -y --no-install-recommends git build-essential && \
rm -rf /var/lib/apt/lists/* && \
git clone --recursive https://github.com/pytorch/FBGEMM.git /tmp/FBGEMM && \
cd /tmp/FBGEMM && \
git checkout 10b775730212923f65f7b78f79b6a01d80cf3c29 && \
git submodule update --init --recursive && \
cd fbgemm_gpu && \
# Filter `fairscale` and the torch family from fbgemm's requirements.txt:
# fairscale pulls a CPU torch that would clobber the +rocm7.2 wheel installed
# above. fairscale is a distributed-training lib used by fbgemm tests, not
# by the build itself.
grep -v -E '^(fairscale|torch|torchvision|torchaudio)([<>=!]|$)' requirements.txt > /tmp/req.txt && \
pip install -r /tmp/req.txt && \
python setup.py -j 32 bdist_wheel \
--build-target=default \
--build-variant=rocm \
-DHIP_ROOT_DIR=/opt/rocm \
-DAMDGPU_TARGETS=gfx950 && \
pip install --force-reinstall --no-deps dist/fbgemm_gpu_nightly_rocm*.whl && \
cd / && rm -rf /tmp/FBGEMM

# polars-u64-idx — training_recipe.md:44 (mandatory; yambda-5b > 4.29 B rows).
# Remaining packages — training_recipe.md:156-159 ("Additional Python deps") plus
# `datasets` + `huggingface_hub`, which the recipe does not list but
# preprocess_public_data.py:278 imports to download yambda from HuggingFace.
RUN pip install \
polars-u64-idx==1.33.1 \
gin-config \
absl-py \
datasets \
huggingface_hub \
pyre-extensions \
iopath \
typing-inspect \
psutil \
tqdm \
pyyaml \
lightning-utilities && \
# torchmetrics and tensordict declare `torch` as a dep; without --no-deps
# pip pulls torch==2.12.0+cu130 from PyPI which clobbers the +rocm7.2 wheel
# we installed above (libtorch_hip.so disappears, fbgemm_gpu fails to load).
pip install --no-deps \
torchmetrics==1.0.3 \
tensordict

# mlperf_logging — required by train/mlperf_logging_utils.py for MLPerf
# compliance logs. Pinned to the Training 6.0 tag for reproducibility; --no-deps
# so pip does not resolve requirements.txt's torch/fbgemm_gpu/torchrec pins and
# clobber the +rocm7.2 wheels above.
RUN pip install --no-deps "git+https://github.com/mlcommons/logging.git@6.0.0-rc6"

# Smoke-test the 6 imports the launch script checks at
# scripts/launch_smoke_8gpu.sh:26.
RUN python -c "import torch, fbgemm_gpu, torchrec, polars, xxhash, gin; \
print('torch', torch.__version__, '| hip', getattr(torch.version, 'hip', None))"

COPY . /workspace/recommendation_v4

ENV PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
HSTU_HAMMER_KERNEL=TRITON \
DLRM_DATA_PATH=/data/mlperf_dlrm_v4

CMD ["bash"]
Loading
Loading