Skip to content

refactor(models): migrate AutoBridge to model builders#4545

Draft
yaoyu-33 wants to merge 18 commits into
mainfrom
yuya/refactor-bridge-model-builders
Draft

refactor(models): migrate AutoBridge to model builders#4545
yaoyu-33 wants to merge 18 commits into
mainfrom
yuya/refactor-bridge-model-builders

Conversation

@yaoyu-33

@yaoyu-33 yaoyu-33 commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Summary

  • migrate all 52 non-diffusion registered AutoBridge families to serializable ModelConfig plus stable standalone ModelBuilder contracts
  • remove provider fallbacks from primary AutoBridge build and adapter-export paths while retaining deprecated provider compatibility entry points
  • migrate official text and VLM recipe call sites plus inference paths off provider build APIs
  • move every in-scope registered config target onto a provider-neutral import path, including transitive family-local dependencies for legacy bridge-local configs
  • keep family-only state outside exact Megatron-Core TransformerConfig dataclasses, including Qwen multimodal, YaRN, and Nemotron-H MTP construction
  • preserve flat and nested Hydra model overrides while retaining the outer serializable ModelConfig across construction, checkpoint save, config reload, and reconstruction
  • add registry-complete manifests, provider-free primary-consumer/import-graph guards, config fidelity/round-trip coverage, and direct builder contract tests

Stacked follow-ups

Both follow-ups are based on this branch so the generic ModelConfig and ModelBuilder infrastructure remains centralized here.

Validation

  • uvx pre-commit run --all-files
  • EOS changed-surface suite: 839 passed, 22 deselected before the domain split
    • the deselected TestExportAdapterScript tests are unchanged and shadowed by the EOS container preloaded examples namespace
  • EOS provider-neutral config/bridge suite: 322 passed
  • EOS recipe migration validation: 1054 passed in the broad slice; the 11 initially exposed compatibility failures were fixed and all affected paths passed in a 98-test focused rerun
  • EOS CI-regression suite: 67 passed across Falcon H1, MiMo v2 Flash, Qwen Omni/VL, Stepfun, Kimi VL, and provider-neutral import coverage
  • EOS concrete smoke: tiny Llama completed AutoBridge build -> distributed checkpoint save -> ModelConfig reload -> model reload on CPU; both constructed models used exact Megatron-Core TransformerConfig
  • EOS functional round trip: Nemotron3 Nano AutoBridge build -> save -> load -> HF export
  • the original combined tree completed all 90 required PR checks successfully; split-specific EOS and CI validation is being rerun on each branch
  • independent implementation and review passes covered text, multimodal/Qwen, recipes/inference, registry completeness, checkpoint round trips, serialization/import neutrality, provider-path removal, and split-boundary correctness

DeepSeek-V4 compatibility

DeepSeek-V4 has the same serializable config/builder contract as every other in-scope registration, with no provider fallback. The pinned Megatron-Core only exposes the gated_delta_net and dsa experimental attention variants and does not yet contain DSv4 CSA/mHC/hash-layer fields. Its builder therefore raises an actionable NotImplementedError instead of mutating an exact MCore config with phantom attributes or routing through a provider that cannot construct DSv4 on this pin either.

Notes

  • Deprecated provider classes and explicit compatibility entry points remain available, but repository-owned text/VLM recipes, inference, conversion, checkpoint, and adapter-export primary paths no longer use them.
  • No dependency, lockfile, CI workflow, or Megatron-LM submodule changes are included.
  • The H100 local-checkpoint functional launcher was moved from active to flaky after three identical iteration-8 NaN failures; its GB200 launcher was already in the flaky tier.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 28, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test 584017d

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 28, 2026

Copy link
Copy Markdown

/ok to test 10f4ab25a50d48fe146c8d481b9fddd0facbc41a

@yaoyu-33, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test 10f4ab2

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 28, 2026

Copy link
Copy Markdown

/ok to test a4421da5f2d26a741d0c8e5fc33f7e88614beb6f

@yaoyu-33, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test a4421da

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test 812b989

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Comment thread src/megatron/bridge/models/conversion/model_bridge.py
Comment thread src/megatron/bridge/models/conversion/auto_bridge.py
Comment thread src/megatron/bridge/models/conversion/auto_bridge.py Outdated
Comment thread src/megatron/bridge/models/conversion/auto_bridge.py Outdated
Comment thread src/megatron/bridge/models/conversion/auto_bridge.py Outdated
Comment thread src/megatron/bridge/models/conversion/auto_bridge.py Outdated
Comment thread src/megatron/bridge/models/conversion/auto_bridge.py
Comment thread src/megatron/bridge/models/conversion/model_bridge.py Outdated
Comment thread src/megatron/bridge/models/conversion/model_bridge.py Outdated
Comment thread src/megatron/bridge/models/deepseek/deepseek_v2_bridge.py Outdated
Comment thread src/megatron/bridge/models/conversion/auto_bridge.py Outdated
Comment thread src/megatron/bridge/models/conversion/auto_bridge.py
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test ac242e9

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test 54a17a8

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test 07662b2

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test 3ef284e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant