Releases: ROCm/ATOM
Releases · ROCm/ATOM
v0.1.3
What's Changed
- CI: Verify ATOM tests on MI35x runners by @gyohuangxin in #444
- [Dashboard] Add column sorting and show Total Throughput in Data tab by @ChuanLi1101 in #447
- CI: reduce HF token exposure in atom-test logs by @gyohuangxin in #453
- Update GLM-5.md to refer to atom-dev docker container by @dwiddows in #454
- [plugin] refine full OOT validation & OOT benchmark by @zejunchen-zejun in #388
- Fix: support transformers 4.57.6 and 5.2.0 for gpt-oss by @PerryZhang01 in #419
- [plugin][oot] Add Kimi-K2.5 support by @gbyu-amd in #401
- [FIX] wrapper fused_qk_rmsnorm to ensure correct dispatch by @gbyu-amd in #456
- [plugin][triton version] align triton version with ATOM native by @zejunchen-zejun in #458
- fix deepseek tp 4 mtp3 mla metadata error by @junhaha666 in #460
- [dashboard] Fix column wrapping and polish UI by @carlushuang in #461
- Fix GLM-5-FP8 Indexer.weights_proj GEMM crash via exclude name remapping by @thpereir in #451
- [OOT Plugin] Fix qwen3.5 fp8 functionality and accuracy issue by @ganyi1996ppo in #448
- [OOT][Recipe] Update qwen3next recipe for performance measturement by @ganyi1996ppo in #466
- [Qwen3Next][Perf] add fused chunk split kernel for qkvzba and qkvz,ba case by @ganyi1996ppo in #457
- add GLM5 and Kimi2.5 to CI per PR by @valarLip in #470
- [Bugfix] remap_layer_name with kw parameter passing by @ganyi1996ppo in #469
- [Enhancement] load quark gpt_oss 120b wmxfp4 afp8 by @haoyangli0109 in #445
- Revert "add kernel comparison dashboard (ATOM OOT vs vLLM v0.18)" by @valarLip in #471
- CI: skip model download when cache exists by @gyohuangxin in #474
- Support enable_thinking False in v1/chat/completions mode by @ZhangLirong-amd in #472
- [Performance] Relaxed mtp by @haoyangli0109 in #411
- CI: improve atom test image and model cache handling by @gyohuangxin in #476
- [feat] Make ATOM work with SGLang out-of-tree by @zhuyuhua-v in #355
- fix: fix tp8 accuracy for gpt-oss by @PerryZhang01 in #449
- [plugin][OOT CI] refine OOT CI/dashboard/OOT docker release by @zejunchen-zejun in #459
- [plugin][OOT benchmark] set the final job for uploading data to dashboard by @zejunchen-zejun in #482
- Remove watermark overlay from dashboard by @functionstackx in #481
- fix(eagle): skip attn_metadata update for non-16-head models by @valarLip in #484
- refactor(dashboard): redesign based on data visualization best practices by @valarLip in #492
- CI: enable AMD CI monitor workflow in ATOM by @gyohuangxin in #504
- fix(dashboard): unify trends chart click to showPopover by @valarLip in #498
- [plugin][fix] fix kimi k2.5 weight loading by @gbyu-amd in #496
- CI: add lock-protected model downloads to ATOM tests by @gyohuangxin in #509
- [plugin][recipe] update kimi recipes by @gbyu-amd in #513
- Update README with installation instructions by @asleepzzz in #505
- support dual stream in prepare decode by @ZhangLirong-amd in #499
- Fix GLM5-MXFP4 loading error by @thpereir in #493
- [plugin][script] update env var for oot benchmark/test by @gbyu-amd in #516
- [plugin][upgrade vLLM] upgrade vLLM to 0.19.0 commit 2a6994 by @zejunchen-zejun in #483
- docs: Add model run guide and update GPU support table by @sunway513 in #520
- [dashboard] support light mode by @carlushuang in #529
- feat(ci): add SGLang image release and validation workflows by @zhuyuhua-v in #510
- fix: remove rmsnorm allreduce fusion in gpt-oss by @PerryZhang01 in #526
- Use SKILL to automatically enable Llama workloads for ATOM vLLM Plugin by @wuhuikx in #467
- feat(benchmark): add Llama-3.1-70B-Instruct-FP8 to nightly benchmark by @sunway513 in #536
- Add rocm-trace-lite (RTL) for GPU kernel profiling by @sunway513 in #535
- CI: inject AMD_HF_TOKEN on predownload runner by @gyohuangxin in #542
- add GLM-5.1-FP8 into accuracy ci by @valarLip in #519
- feat: replace triton fused_rms_fp8_group_quant with HIP kernel by @valarLip in #507
- [server] Refactor OpenAI server with tool calling, reasoning, and debug logging by @carlushuang in #489
- CI: prefer S3 aiter wheel before artifact fallback by @gyohuangxin in #538
- fix glm mtp weight by @jiayyu in #531
- feat: improve profiler robustness and enable stream parallelism for decode metadata by @valarLip in #547
- [plugin][dashboard] use nightly date tagged docker by @zejunchen-zejun in #503
- [Qwen3.5] add gemm fusion for qwen3.5 qkvzba bf16 case by @ganyi1996ppo in #543
- [UI] Scope ATOM watermark to Performance tab only by @sunway513 in #524
- docs: add Hermes Agent setup guide by @carlushuang in #551
- CI: Keep scheduled main runs from blocking push-triggered validation. by @gyohuangxin in #556
- fix: allow MiniMax-M2.5 loading in TP1 on MI355X by @benenzhu in #558
- [Qwem3.5] atom native support for qwen3.5 by @ganyi1996ppo in #517
- ci: fix benchmark workflow bugs and add GLM-5.1-MXFP4 by @valarLip in #562
- [recipe] ds r1 fp4 mtp3 model change by @seungrokj in #563
- ci: add MiniMax-M2.5-MXFP4 to nightly benchmark and accuracy test by @valarLip in #564
- [Qwen3Next/Qwen3.5] fuse gated_rmsnorm_quant by @ganyi1996ppo in #421
- [Feat][Plugin] Enable Sparse MLA and GLM-5 for vLLM-ATOM by @kliuae-amd in #399
- try to print server log by @jiayyu in #550
- [BugFix] enable deepseek r1 fp4 by @ZLkanyo009 in #527
- fix: disable gradient tracking on all nn.Parameter for inference by @valarLip in #574
- fix(docker): remove pinned torchvision/torchaudio wheels for vllm and sglang by @zhuyuhua-v in #575
- [Bugfix] Remove custom config class since transformers 5.2.0 already support qwen3.5 and fix bf16 loading issue by @ganyi1996ppo in #570
- fix mamba blocks ref count by @jiayyu in #581
- fix(dashboard): Fix SGLang benchmark workflow and integrate into dashboard by @zhuyuhua-v in #548
- integrate flydsl gdr decode by @ganyi1996ppo in #568
- [atom-vllm][atom-sglang][CI] build CI image on GPU machine instead of a build-only machine by @zejunchen-zejun in #561
- fix: fix thresholds for nightly model by @PerryZhang01 in #584
- [nightly][vllm] add GLM-5.1-FP8 to vLLM nightly coverage by @whx-sjtu in #569
- Suppport TBO in ATOM by @ZhangLirong-amd in #515
- [ATOM Test] fix file issue when finish lmeval acc test by @zejunchen-zejun in #585
- mxfp4 support for qwen3.5 by @ganyi1996ppo in #576
- [Feat] remove flatten in atom sglang mla like atom vllm mla by @ZLkanyo009 in #525
- fix kv cache fp8 issue by @ganyi1996ppo in #588
- fix: optimize attention metadata by @PerryZhang01 in #571
- fix: Add dashboard_model for SGLang deepseek TP4 by @zhuyuhua-v in #595
- Add Qwen3.5 FP4 to vLLM-ATOM nightly accuracy check and benchmark by @wuhuikx in #593
- fix: prefix benchmark artifacts to avoid downloading multi-GB traces by @valarLip in #596
- Update mori version and make tbo in nightly by @ZhangLirong-amd in #590
- [dashboard] show docker info in dashboard accuracy chart by @zejunchen-zejun in #587
- ci: add Qwen3.5-397B-A17B-MXFP4 to per-PR accuracy test by @valarLip in https://github.com/ROCm/ATOM/pu...
v0.1.2
What's Changed
- move from private repo to ROCm by @valarLip in #1
- Update logo by @carlushuang in #2
- update the link to the repo by @sunway513 in #3
- fix the example code cmd by @sunway513 in #5
- Update README.md and pyproject.toml by @andyluo7 in #4
- update readme for deepseek by @valarLip in #6
- support gpt oss by @junhaha666 in #7
- gpt_oss update: add fused_qk_rope_reshape_and_cache by @junhaha666 in #9
- Fix server startup message to show after model is loaded by @indianspeedster in #10
- fix gpt_oss accuracy drop by @junhaha666 in #12
- deepseek fp4 by @junhaha666 in #8
- gpt_oss: fix moe pad && use uniified attention 3d for full attention decode by @junhaha666 in #15
- move preprocess into threadpool avoid serial process by @HaonanWang98 in #19
- [perf] add qknorm_quant fusion for DS by @gbyu-amd in #18
- engine max_model_len : default is set to hf_config.max_position_embed… by @junhaha666 in #21
- [perf] add qknorm_quant and ar_rmsnorm fusion for DS by @gbyu-amd in #17
- reduce data for ScheduledBatch by @valarLip in #23
- fix port by @valarLip in #24
- Mla cache udpate by @junhaha666 in #20
- Add license and copyright headers by @ppalaniappan-amd in #30
- add ATOM_PROFILER by @amd-ruitang3 in #33
- use aiter hip fused_qk_rope_concat_and_cache_mla by @junhaha666 in #31
- add qwen3 moe model support by @gbyu-amd in #22
- support mtp stage 1: support draft model load by @jiayyu in #39
- update benchmark to inferencemax version by @HaonanWang98 in #38
- update server by @valarLip in #41
- refactor prepare_kv_indices by @valarLip in #43
- CI: Initial ATOM CI by @gyohuangxin in #40
- limit max_split_per_batch to 16 by @valarLip in #47
- support block size convert by @junhaha666 in #51
- fix num_kvcache_blocks error by @junhaha666 in #52
- Refactor arg_utils.py by @HaonanWang98 in #53
- Adapt lm-eval chat completion request. by @HaonanWang98 in #58
- CI: Add Dockerfile and nightly docker release pipeline by @gyohuangxin in #46
- remove global dict for request id in stream mode by @HaonanWang98 in #60
- CI: Add timeout for Nightly docker release by @gyohuangxin in #62
- CI: Add ROCm 7.2 preview nightly image by @gyohuangxin in #66
- update readme by @valarLip in #67
- update readme by @valarLip in #71
- CI: Add deepseek in ATOM tests by @gyohuangxin in #64
- Making BMM use fp4 weights by @omuhamma in #57
- llfp4 weight scale shuffle fix by @amirumoAMD in #74
- ds3.2: add one param for top_k_per_row_prefill ops by @PerryZhang01 in #77
- support aiter.gemm_a4w4 api changes by @junhaha666 in #70
- remove timeout for inter token latency by @HaonanWang98 in #79
- Enable INT4 QR for LLFP4 by @amirumoAMD in #76
- update utiliy by @valarLip in #81
- update_server by @valarLip in #82
- Update Dockerfile by @valarLip in #83
- Update Dockerfile by @valarLip in #87
- Graph: add param check for cuda graph capture by @PerryZhang01 in #85
- CI: Temporarily split gfx942 and gfx950 in nigthly docker release by @gyohuangxin in #89
- CI: Temporarily split gfx942 and gfx950 in nigthly docker release pushing by @gyohuangxin in #90
- CI: Increase timeout when building nightly docker image by @gyohuangxin in #91
- CI: Update dockerfile to use PREBUILD_KERNELS=1 by @gyohuangxin in #92
- remove async eng by @HaonanWang98 in #86
- Perf: save perfermance info in beautiful format by @PerryZhang01 in #80
- CI: Update base image to rocm/pytorch:latest in ATOM tests by @gyohuangxin in #88
- CI: Fix issues in nightly build pipeline by @gyohuangxin in #93
- CI: skip tests when building gfx942 nigthly docker image by @gyohuangxin in #96
- MLA: update aiter mqa kernel by @PerryZhang01 in #95
- CI: Fix node issues and use pre-download to accelerate tests by @gyohuangxin in #94
- clear schedule redundant variables by @inkcherry in #100
- Gpt oss triton moe by @junhaha666 in #98
- CI: Fix output issues and add gsm8k accuracy tests in CI by @gyohuangxin in #73
- CI: Fix issues by @gyohuangxin in #102
- CI: Add gpt-oss model by @gyohuangxin in #103
- feat: support pa_decode_gluon and refactor attention ops by @PerryZhang01 in #42
- CI: Fix CI issues by @gyohuangxin in #104
- PA: add ATOM_GPT_OSS_MODEL env for prefill attention by @PerryZhang01 in #105
- CI: Add MAX_JOBS when building the nightly image by @gyohuangxin in #106
- Update docker-release.yaml by @gyohuangxin in #107
- [Perf][Qwen3] Enable qknorm_rope_cache_quant fusion by @gbyu-amd in #65
- [fix] fix assert for Qwen3 by @gbyu-amd in #108
- DeepSeek v3.2: add sparse prefill mla and fix indexer rope by @junhaha666 in #109
- [CI] add Qwen3-235B-A22B-Instruct-2507-FP8 to CI by @gbyu-amd in #110
- Update Dockerfile to put aiter/atom under dir /app by @valarLip in #112
- adapt for opitimized ps_gluon_pa by @Bernard-Liu in #117
- fuse rmsnorm + quant for llama fp8 by @scxiao in #56
- code cleanup by @valarLip in #120
- fix deepseek accuracy when ENABLE_DS_QKNORM_QUANT_FUSION=1 by @junhaha666 in #121
- Update Dockerfile to install latest RCCL by @valarLip in #123
- Update atom_test.sh by @valarLip in #122
- CI: Enhance the docker release pipeline by @gyohuangxin in #125
- llfp4 fail temporary workaround by @amirumoAMD in #75
- CI: Fix the docker relase pipeline by @gyohuangxin in #131
- Fix torch 2.9 rlock error in torch compile by @ZhangLirong-amd in #114
- CI: Fix CI issues by @gyohuangxin in #135
- adapt for upstream gluon pa by @Bernard-Liu in #137
- [fix] fix gluon pa with bf16 kv by @gbyu-amd in #124
- CI: Speed up CI by using a nightly image instead of rebuilding each time by @gyohuangxin in #136
- [recipe] Add qwen3 235b recipe by @gbyu-amd in #111
- Fix defer output for conc>max_num_seqs by @valarLip in #134
- CI: Collect Accuracy tests summary by @gyohuangxin in #132
- [Triton] DS FP4/FP8 Triton fusion and GEMM optimization by @k50112113 in #119
- Fix DP issues in benchmark and support Mori in Moe by @ZhangLirong-amd in #72
- re-enable ATOM_ENABLE_DS_QKNORM_QUANT_FUSION regardless of ATOM_USE_T… by @k50112113 in #139
- fuse rmsnorm/quant and act_mul/quant for mxfp4 llama70B by @scxiao in #129
- use ck mha instead of triton unified_attention for sink and window by @junhaha666 in #118
- Fix attention mha logic error by @ZhangLirong-amd in #141
- CI: Add gpt-oss-120b 2 GPUs test by @gyohuangxin in #143
- shuffle_weights_update by @valarLip in #144
- Add the external facing doc draft for review by @ChuanLi1101 in #99
- CI: Re-enable dual-arch builds in the Docker nightly releases by @gyohuangxin...