Releases · ROCm/ATOM

31 May 03:01

valarLip

v0.1.3

bdcc62e

v0.1.3 Latest

Latest

What's Changed

CI: Verify ATOM tests on MI35x runners by @gyohuangxin in #444
[Dashboard] Add column sorting and show Total Throughput in Data tab by @ChuanLi1101 in #447
CI: reduce HF token exposure in atom-test logs by @gyohuangxin in #453
Update GLM-5.md to refer to atom-dev docker container by @dwiddows in #454
[plugin] refine full OOT validation & OOT benchmark by @zejunchen-zejun in #388
Fix: support transformers 4.57.6 and 5.2.0 for gpt-oss by @PerryZhang01 in #419
[plugin][oot] Add Kimi-K2.5 support by @gbyu-amd in #401
[FIX] wrapper fused_qk_rmsnorm to ensure correct dispatch by @gbyu-amd in #456
[plugin][triton version] align triton version with ATOM native by @zejunchen-zejun in #458
fix deepseek tp 4 mtp3 mla metadata error by @junhaha666 in #460
[dashboard] Fix column wrapping and polish UI by @carlushuang in #461
Fix GLM-5-FP8 Indexer.weights_proj GEMM crash via exclude name remapping by @thpereir in #451
[OOT Plugin] Fix qwen3.5 fp8 functionality and accuracy issue by @ganyi1996ppo in #448
[OOT][Recipe] Update qwen3next recipe for performance measturement by @ganyi1996ppo in #466
[Qwen3Next][Perf] add fused chunk split kernel for qkvzba and qkvz,ba case by @ganyi1996ppo in #457
add GLM5 and Kimi2.5 to CI per PR by @valarLip in #470
[Bugfix] remap_layer_name with kw parameter passing by @ganyi1996ppo in #469
[Enhancement] load quark gpt_oss 120b wmxfp4 afp8 by @haoyangli0109 in #445
Revert "add kernel comparison dashboard (ATOM OOT vs vLLM v0.18)" by @valarLip in #471
CI: skip model download when cache exists by @gyohuangxin in #474
Support enable_thinking False in v1/chat/completions mode by @ZhangLirong-amd in #472
[Performance] Relaxed mtp by @haoyangli0109 in #411
CI: improve atom test image and model cache handling by @gyohuangxin in #476
[feat] Make ATOM work with SGLang out-of-tree by @zhuyuhua-v in #355
fix: fix tp8 accuracy for gpt-oss by @PerryZhang01 in #449
[plugin][OOT CI] refine OOT CI/dashboard/OOT docker release by @zejunchen-zejun in #459
[plugin][OOT benchmark] set the final job for uploading data to dashboard by @zejunchen-zejun in #482
Remove watermark overlay from dashboard by @functionstackx in #481
fix(eagle): skip attn_metadata update for non-16-head models by @valarLip in #484
refactor(dashboard): redesign based on data visualization best practices by @valarLip in #492
CI: enable AMD CI monitor workflow in ATOM by @gyohuangxin in #504
fix(dashboard): unify trends chart click to showPopover by @valarLip in #498
[plugin][fix] fix kimi k2.5 weight loading by @gbyu-amd in #496
CI: add lock-protected model downloads to ATOM tests by @gyohuangxin in #509
[plugin][recipe] update kimi recipes by @gbyu-amd in #513
Update README with installation instructions by @asleepzzz in #505
support dual stream in prepare decode by @ZhangLirong-amd in #499
Fix GLM5-MXFP4 loading error by @thpereir in #493
[plugin][script] update env var for oot benchmark/test by @gbyu-amd in #516
[plugin][upgrade vLLM] upgrade vLLM to 0.19.0 commit 2a6994 by @zejunchen-zejun in #483
docs: Add model run guide and update GPU support table by @sunway513 in #520
[dashboard] support light mode by @carlushuang in #529
feat(ci): add SGLang image release and validation workflows by @zhuyuhua-v in #510
fix: remove rmsnorm allreduce fusion in gpt-oss by @PerryZhang01 in #526
Use SKILL to automatically enable Llama workloads for ATOM vLLM Plugin by @wuhuikx in #467
feat(benchmark): add Llama-3.1-70B-Instruct-FP8 to nightly benchmark by @sunway513 in #536
Add rocm-trace-lite (RTL) for GPU kernel profiling by @sunway513 in #535
CI: inject AMD_HF_TOKEN on predownload runner by @gyohuangxin in #542
add GLM-5.1-FP8 into accuracy ci by @valarLip in #519
feat: replace triton fused_rms_fp8_group_quant with HIP kernel by @valarLip in #507
[server] Refactor OpenAI server with tool calling, reasoning, and debug logging by @carlushuang in #489
CI: prefer S3 aiter wheel before artifact fallback by @gyohuangxin in #538
fix glm mtp weight by @jiayyu in #531
feat: improve profiler robustness and enable stream parallelism for decode metadata by @valarLip in #547
[plugin][dashboard] use nightly date tagged docker by @zejunchen-zejun in #503
[Qwen3.5] add gemm fusion for qwen3.5 qkvzba bf16 case by @ganyi1996ppo in #543
[UI] Scope ATOM watermark to Performance tab only by @sunway513 in #524
docs: add Hermes Agent setup guide by @carlushuang in #551
CI: Keep scheduled main runs from blocking push-triggered validation. by @gyohuangxin in #556
fix: allow MiniMax-M2.5 loading in TP1 on MI355X by @benenzhu in #558
[Qwem3.5] atom native support for qwen3.5 by @ganyi1996ppo in #517
ci: fix benchmark workflow bugs and add GLM-5.1-MXFP4 by @valarLip in #562
[recipe] ds r1 fp4 mtp3 model change by @seungrokj in #563
ci: add MiniMax-M2.5-MXFP4 to nightly benchmark and accuracy test by @valarLip in #564
[Qwen3Next/Qwen3.5] fuse gated_rmsnorm_quant by @ganyi1996ppo in #421
[Feat][Plugin] Enable Sparse MLA and GLM-5 for vLLM-ATOM by @kliuae-amd in #399
try to print server log by @jiayyu in #550
[BugFix] enable deepseek r1 fp4 by @ZLkanyo009 in #527
fix: disable gradient tracking on all nn.Parameter for inference by @valarLip in #574
fix(docker): remove pinned torchvision/torchaudio wheels for vllm and sglang by @zhuyuhua-v in #575
[Bugfix] Remove custom config class since transformers 5.2.0 already support qwen3.5 and fix bf16 loading issue by @ganyi1996ppo in #570
fix mamba blocks ref count by @jiayyu in #581
fix(dashboard): Fix SGLang benchmark workflow and integrate into dashboard by @zhuyuhua-v in #548
integrate flydsl gdr decode by @ganyi1996ppo in #568
[atom-vllm][atom-sglang][CI] build CI image on GPU machine instead of a build-only machine by @zejunchen-zejun in #561
fix: fix thresholds for nightly model by @PerryZhang01 in #584
[nightly][vllm] add GLM-5.1-FP8 to vLLM nightly coverage by @whx-sjtu in #569
Suppport TBO in ATOM by @ZhangLirong-amd in #515
[ATOM Test] fix file issue when finish lmeval acc test by @zejunchen-zejun in #585
mxfp4 support for qwen3.5 by @ganyi1996ppo in #576
[Feat] remove flatten in atom sglang mla like atom vllm mla by @ZLkanyo009 in #525
fix kv cache fp8 issue by @ganyi1996ppo in #588
fix: optimize attention metadata by @PerryZhang01 in #571
fix: Add dashboard_model for SGLang deepseek TP4 by @zhuyuhua-v in #595
Add Qwen3.5 FP4 to vLLM-ATOM nightly accuracy check and benchmark by @wuhuikx in #593
fix: prefix benchmark artifacts to avoid downloading multi-GB traces by @valarLip in #596
Update mori version and make tbo in nightly by @ZhangLirong-amd in #590
[dashboard] show docker info in dashboard accuracy chart by @zejunchen-zejun in #587
ci: add Qwen3.5-397B-A17B-MXFP4 to per-PR accuracy test by @valarLip in https://github.com/ROCm/ATOM/pu...

Contributors

sunway513, asleepzzz, and 40 other contributors

Assets 2

30 Mar 04:20

valarLip

v0.1.2

0079204

v0.1.2

What's Changed

move from private repo to ROCm by @valarLip in #1
Update logo by @carlushuang in #2
update the link to the repo by @sunway513 in #3
fix the example code cmd by @sunway513 in #5
Update README.md and pyproject.toml by @andyluo7 in #4
update readme for deepseek by @valarLip in #6
support gpt oss by @junhaha666 in #7
gpt_oss update: add fused_qk_rope_reshape_and_cache by @junhaha666 in #9
Fix server startup message to show after model is loaded by @indianspeedster in #10
fix gpt_oss accuracy drop by @junhaha666 in #12
deepseek fp4 by @junhaha666 in #8
gpt_oss: fix moe pad && use uniified attention 3d for full attention decode by @junhaha666 in #15
move preprocess into threadpool avoid serial process by @HaonanWang98 in #19
[perf] add qknorm_quant fusion for DS by @gbyu-amd in #18
engine max_model_len : default is set to hf_config.max_position_embed… by @junhaha666 in #21
[perf] add qknorm_quant and ar_rmsnorm fusion for DS by @gbyu-amd in #17
reduce data for ScheduledBatch by @valarLip in #23
fix port by @valarLip in #24
Mla cache udpate by @junhaha666 in #20
Add license and copyright headers by @ppalaniappan-amd in #30
add ATOM_PROFILER by @amd-ruitang3 in #33
use aiter hip fused_qk_rope_concat_and_cache_mla by @junhaha666 in #31
add qwen3 moe model support by @gbyu-amd in #22
support mtp stage 1: support draft model load by @jiayyu in #39
update benchmark to inferencemax version by @HaonanWang98 in #38
update server by @valarLip in #41
refactor prepare_kv_indices by @valarLip in #43
CI: Initial ATOM CI by @gyohuangxin in #40
limit max_split_per_batch to 16 by @valarLip in #47
support block size convert by @junhaha666 in #51
fix num_kvcache_blocks error by @junhaha666 in #52
Refactor arg_utils.py by @HaonanWang98 in #53
Adapt lm-eval chat completion request. by @HaonanWang98 in #58
CI: Add Dockerfile and nightly docker release pipeline by @gyohuangxin in #46
remove global dict for request id in stream mode by @HaonanWang98 in #60
CI: Add timeout for Nightly docker release by @gyohuangxin in #62
CI: Add ROCm 7.2 preview nightly image by @gyohuangxin in #66
update readme by @valarLip in #67
update readme by @valarLip in #71
CI: Add deepseek in ATOM tests by @gyohuangxin in #64
Making BMM use fp4 weights by @omuhamma in #57
llfp4 weight scale shuffle fix by @amirumoAMD in #74
ds3.2: add one param for top_k_per_row_prefill ops by @PerryZhang01 in #77
support aiter.gemm_a4w4 api changes by @junhaha666 in #70
remove timeout for inter token latency by @HaonanWang98 in #79
Enable INT4 QR for LLFP4 by @amirumoAMD in #76
update utiliy by @valarLip in #81
update_server by @valarLip in #82
Update Dockerfile by @valarLip in #83
Update Dockerfile by @valarLip in #87
Graph: add param check for cuda graph capture by @PerryZhang01 in #85
CI: Temporarily split gfx942 and gfx950 in nigthly docker release by @gyohuangxin in #89
CI: Temporarily split gfx942 and gfx950 in nigthly docker release pushing by @gyohuangxin in #90
CI: Increase timeout when building nightly docker image by @gyohuangxin in #91
CI: Update dockerfile to use PREBUILD_KERNELS=1 by @gyohuangxin in #92
remove async eng by @HaonanWang98 in #86
Perf: save perfermance info in beautiful format by @PerryZhang01 in #80
CI: Update base image to rocm/pytorch:latest in ATOM tests by @gyohuangxin in #88
CI: Fix issues in nightly build pipeline by @gyohuangxin in #93
CI: skip tests when building gfx942 nigthly docker image by @gyohuangxin in #96
MLA: update aiter mqa kernel by @PerryZhang01 in #95
CI: Fix node issues and use pre-download to accelerate tests by @gyohuangxin in #94
clear schedule redundant variables by @inkcherry in #100
Gpt oss triton moe by @junhaha666 in #98
CI: Fix output issues and add gsm8k accuracy tests in CI by @gyohuangxin in #73
CI: Fix issues by @gyohuangxin in #102
CI: Add gpt-oss model by @gyohuangxin in #103
feat: support pa_decode_gluon and refactor attention ops by @PerryZhang01 in #42
CI: Fix CI issues by @gyohuangxin in #104
PA: add ATOM_GPT_OSS_MODEL env for prefill attention by @PerryZhang01 in #105
CI: Add MAX_JOBS when building the nightly image by @gyohuangxin in #106
Update docker-release.yaml by @gyohuangxin in #107
[Perf][Qwen3] Enable qknorm_rope_cache_quant fusion by @gbyu-amd in #65
[fix] fix assert for Qwen3 by @gbyu-amd in #108
DeepSeek v3.2: add sparse prefill mla and fix indexer rope by @junhaha666 in #109
[CI] add Qwen3-235B-A22B-Instruct-2507-FP8 to CI by @gbyu-amd in #110
Update Dockerfile to put aiter/atom under dir /app by @valarLip in #112
adapt for opitimized ps_gluon_pa by @Bernard-Liu in #117
fuse rmsnorm + quant for llama fp8 by @scxiao in #56
code cleanup by @valarLip in #120
fix deepseek accuracy when ENABLE_DS_QKNORM_QUANT_FUSION=1 by @junhaha666 in #121
Update Dockerfile to install latest RCCL by @valarLip in #123
Update atom_test.sh by @valarLip in #122
CI: Enhance the docker release pipeline by @gyohuangxin in #125
llfp4 fail temporary workaround by @amirumoAMD in #75
CI: Fix the docker relase pipeline by @gyohuangxin in #131
Fix torch 2.9 rlock error in torch compile by @ZhangLirong-amd in #114
CI: Fix CI issues by @gyohuangxin in #135
adapt for upstream gluon pa by @Bernard-Liu in #137
[fix] fix gluon pa with bf16 kv by @gbyu-amd in #124
CI: Speed up CI by using a nightly image instead of rebuilding each time by @gyohuangxin in #136
[recipe] Add qwen3 235b recipe by @gbyu-amd in #111
Fix defer output for conc>max_num_seqs by @valarLip in #134
CI: Collect Accuracy tests summary by @gyohuangxin in #132
[Triton] DS FP4/FP8 Triton fusion and GEMM optimization by @k50112113 in #119
Fix DP issues in benchmark and support Mori in Moe by @ZhangLirong-amd in #72
re-enable ATOM_ENABLE_DS_QKNORM_QUANT_FUSION regardless of ATOM_USE_T… by @k50112113 in #139
fuse rmsnorm/quant and act_mul/quant for mxfp4 llama70B by @scxiao in #129
use ck mha instead of triton unified_attention for sink and window by @junhaha666 in #118
Fix attention mha logic error by @ZhangLirong-amd in #141
CI: Add gpt-oss-120b 2 GPUs test by @gyohuangxin in #143
shuffle_weights_update by @valarLip in #144
Add the external facing doc draft for review by @ChuanLi1101 in #99
CI: Re-enable dual-arch builds in the Docker nightly releases by @gyohuangxin...