[CI] Fix `gpt_dynamic_inference_tp2_pp2_ep2_gptoss_20b_swa` tests by asolergi-nv · Pull Request #5527 · NVIDIA/Megatron-LM

asolergi-nv · 2026-06-28T08:03:46Z

I, the PR author, have personally reviewed every line of this PR.

What does this PR do ?

Summary

The functional test gpt_dynamic_inference_tp2_pp2_ep2_gptoss_20b_swa has failed on main since it was introduced and has never passed in main. It crashes during model construction with:

AttributeError: 'TransformerConfig' object has no attribute 'yarn_rotary_scaling_factor'

File "megatron/core/models/gpt/gpt_model.py", line 186, in __init__
    scaling_factor=getattr(self.config, "yarn_rotary_scaling_factor"),

This PR restores the YaRN configuration so GPT-OSS dynamic inference builds and runs again, and places the wiring at the shared config-construction chokepoint so a future builder refactor cannot silently drop it again.

The problem

GPT-OSS uses YaRN RoPE. GPTModel.__init__ (the position_embedding_type == 'yarn' branch) and yarn_rotary_pos_embedding read seven hyperparameters off the config as dynamic attributes, with no default:

scaling_factor=getattr(self.config, "yarn_rotary_scaling_factor"),
original_max_position_embeddings=getattr(self.config, "yarn_original_max_position_embeddings"),
beta_fast=getattr(self.config, "yarn_beta_fast"),
...

These yarn_* names are not declared as fields on TransformerConfig (only MLATransformerConfig declares the unprefixed equivalents — rotary_scaling_factor, mscale, beta_fast, …). So unless some caller sets config.yarn_* explicitly, the first getattr raises AttributeError.

The dynamic inference path builds its config through GPTModelBuilder(gpt_config_from_args(args)) → core_transformer_config_from_args, which only copies args that are declared dataclass fields of the config class. None of the yarn_* attributes are fields, so they were never set — and the bare getattr crashed.

Why it happened — a clean-merge / semantic conflict between two PRs

The YaRN-for-GPT-OSS feature reads config.yarn_*, but the introducing change (ADLR/megatron-lm!4044, "YaRN support for gpt-oss") only ever set those attributes in two places: the ModelOpt/export builder (megatron/post_training/model_builder.py, gated on --enable-gpt-oss) and, later, the legacy gpt_builder callable.

The functional test arrived in PR #5249 ("Support SWA and sink attention in dynamic inference", merged 2026-06-24 04:02 UTC). That PR wired YaRN into the config inside gpt_builders.py:gpt_builder via a helper _apply_yarn_config_from_args (commit dfdb9e8: "apply YaRN config in gpt_builders when running inference"). At that time the inference entry point routed through gpt_builder, so the attributes were set and the author's smoke test legitimately generated tokens:

generated tokens: [623, 1825, 10648, 1606, 290, 2461, 50005, 4580]
generated text:   " The user wants only the run-time"

About ten hours later, PR #5169 ("Add inference functions… and remove legacy modelbuilder functions", merged 2026-06-24 14:04 UTC) refactored megatron/inference/utils.py:

-from gpt_builders import gpt_builder
-def get_model_for_inference() -> MegatronModule:
-        model_builder = gpt_builder
+        builder = get_model_builder(args)            # -> GPTModelBuilder(gpt_config_from_args(args))
+        model = builder.build_distributed_models(...)

The new GPTModelBuilder.build_model builds GPTModel directly from config.transformer and never calls _apply_yarn_config_from_args. The two PRs touch different files, so git merged them with no textual conflict — but the YaRN wiring that #5249's test depends on now lives in a function (gpt_builder) that the inference path no longer calls. Neither PR's CI caught it: #5249 passed because the legacy path still set the attributes; #5169 predated the new test and so removing the legacy builder broke nothing it tested. The break only materialized once both were in main.

A secondary latent issue: the original _apply_yarn_config_from_args skipped any arg whose value was None, leaving the attribute unset — so even on the legacy path, omitting any --yarn-* flag would reintroduce the same AttributeError.

The fix

Move the YaRN-from-args wiring out of the now-vestigial gpt_builder and into the shared config-construction chokepoint, core_transformer_config_from_args in megatron/training/argument_utils.py. Both the legacy gpt_builder and the new GPTModelBuilder (via gpt_config_from_args) build their config through this function, so both paths are covered by a single source of truth.

The helper:

Runs only when position_embedding_type == 'yarn' and the model is not MLA (MLA declares the unprefixed fields and consumes them in its own attention path).
Maps the unprefixed CLI flags to the prefixed config attributes (--rotary-scaling-factor → yarn_rotary_scaling_factor, --mscale → yarn_mscale, --mscale-all-dim → yarn_mscale_all_dim); the remaining --yarn-* flags map by name.
Guarantees the attributes always exist when YaRN is selected, falling back to YarnRotaryEmbedding's documented defaults when a flag is omitted (original_max_position_embeddings=4096, beta_fast=32.0, beta_slow=1.0, correction_range_round_to_int=True). This closes the None-skip gap so a missing flag degrades gracefully instead of raising AttributeError.
Preserves any value already present on the config (e.g. from YAML or the ModelOpt GPT-OSS builder), so existing paths are unaffected.

The redundant _apply_yarn_config_from_args helper and its call site are removed from gpt_builders.py.

Files changed

megatron/training/argument_utils.py — add _apply_yarn_config_from_args and call it from core_transformer_config_from_args.
gpt_builders.py — remove the duplicate helper and its call (now centralized).

Issue tracking

For PRs from open-source community contributors:

New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

Signed-off-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>

copy-pr-bot · 2026-06-28T08:03:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-06-28T08:03:59Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

asolergi-nv · 2026-06-28T08:04:13Z

/claude review

claude

LGTM

Fix gl ci

7862f09

Signed-off-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>

asolergi-nv requested review from a team as code owners June 28, 2026 08:03

svcnvidia-nemo-ci marked this pull request as draft June 28, 2026 08:03

asolergi-nv requested review from cuichenx and shanmugamr1992 June 28, 2026 08:04

asolergi-nv marked this pull request as ready for review June 28, 2026 08:04

svcnvidia-nemo-ci added the complexity: low label Jun 28, 2026

claude Bot approved these changes Jun 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Fix `gpt_dynamic_inference_tp2_pp2_ep2_gptoss_20b_swa` tests#5527

[CI] Fix `gpt_dynamic_inference_tp2_pp2_ep2_gptoss_20b_swa` tests#5527
asolergi-nv wants to merge 1 commit into
NVIDIA:mainfrom
asolergi-nv:fix-gl-ci

asolergi-nv commented Jun 28, 2026

Uh oh!

copy-pr-bot Bot commented Jun 28, 2026

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

asolergi-nv commented Jun 28, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

asolergi-nv commented Jun 28, 2026

What does this PR do ?

Summary

The problem

Why it happened — a clean-merge / semantic conflict between two PRs

The fix

Files changed

Issue tracking

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented Jun 28, 2026

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

asolergi-nv commented Jun 28, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants