Skip to content

docs: fix README drift + add consistency test (NVBug 6366190)#4488

Open
pruprakash wants to merge 2 commits into
mainfrom
pruprakash/fix_docs_readme_drift_6366190
Open

docs: fix README drift + add consistency test (NVBug 6366190)#4488
pruprakash wants to merge 2 commits into
mainfrom
pruprakash/fix_docs_readme_drift_6366190

Conversation

@pruprakash

Copy link
Copy Markdown

Summary

Fixes NVBug 6366190: 6 documentation drifts across 4 Megatron-Bridge README/tutorial files, aligned with current source, plus a regression test so the drift can't silently return.

Changes

scripts/training/README.md: gpt_126m_pretrain_config → vanilla_gpt_pretrain_config; nonexistent qwen25_vl_pretrain_config → qwen25_vl_7b_sft_config; document the --hf_path flag. tutorials/recipes/llama/README.md: conversion path ../../conversion → examples/conversion/convert_checkpoints.py; GPTDatasetConfig field seq_length → sequence_length (and matching comment). tutorials/data/dclm/README.md: Megatron-LM/tools/preprocess_data.py → 3rdparty/Megatron-LM/tools/preprocess_data.py. tests/unit_tests/docs/test_readme_consistency.py: new stdlib-only test asserting documented recipes/paths/fields/flags match source.

Verification

Red-green: FAIL on pre-fix docs, PASS after. 5 passed in nvcr.io/nvidian/nemo:26.06.rc9.
All six issues reproduced inside the container on 2026-06-23.

Reference

NVBug 6366190

@pruprakash pruprakash requested a review from yaoyu-33 June 24, 2026 21:54
@pruprakash pruprakash added docs Documentation-only updates or documentation debt area:misc Cross-cutting utilities, logging, helpers, and other changes needs-review PR is ready for code review and waiting on a reviewer labels Jun 24, 2026
@copy-pr-bot

copy-pr-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread tests/unit_tests/doc_consistency/test_readme_consistency.py
Comment thread tutorials/data/dclm/README.md
@claude

claude Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Light Code Review

Nice cleanup — the README fixes all look correct against current source, and the regression test is a good idea.

Two issues to address:

1. Test directory will be skipped by pytest (critical)

tests/unit_tests/docs/ matches the "docs" entry in norecursedirs (pyproject.toml), so uv run python -m pytest tests/unit_tests/ will never discover these tests. They will not run in CI. Rename the directory (e.g. doc_consistency/) to avoid the collision.

2. DCLM README: clone vs submodule path mismatch

The preprocess command now correctly uses 3rdparty/Megatron-LM/tools/preprocess_data.py, but the setup instructions above it (line 94) still say git clone https://github.com/NVIDIA/Megatron-LM.git, which creates ./Megatron-LM/. These are now inconsistent — a user following both steps gets file not found. Update the clone instruction to use the submodule, or clarify when each path applies.


Suggested test cases

No perf tests impacted.

Align documented examples with current source and add a regression test so the
drift cannot silently return:
- scripts/training/README.md: gpt_126m_pretrain_config -> vanilla_gpt_pretrain_config;
  nonexistent qwen25_vl_pretrain_config -> qwen25_vl_7b_sft_config; document --hf_path.
- tutorials/recipes/llama/README.md: ../../conversion -> examples/conversion;
  GPTDatasetConfig field seq_length -> sequence_length (and matching comment).
- tutorials/data/dclm/README.md: preprocess_data.py -> 3rdparty/Megatron-LM/; setup step now
  inits the bundled submodule (git submodule update --init 3rdparty/Megatron-LM) instead of
  cloning a standalone copy, so the documented path is consistent for both container and repo.
- tests/unit_tests/doc_consistency/test_readme_consistency.py: stdlib-only test asserting
  documented recipes/paths/fields/flags match source (red-green verified). Placed in
  doc_consistency/ (not docs/) so pyproject norecursedirs does not exclude it from collection.

NVBug: 6366190

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Pruthviraj Prakash <pruprakash@nvidia.com>
@pruprakash pruprakash force-pushed the pruprakash/fix_docs_readme_drift_6366190 branch from 5a1d5e9 to aafba7a Compare June 24, 2026 22:05
@pruprakash

Copy link
Copy Markdown
Author

/ok to test aafba7a

@pruprakash

Copy link
Copy Markdown
Author

/ok to test aafba7a

@NVIDIA-NeMo NVIDIA-NeMo deleted a comment from copy-pr-bot Bot Jun 24, 2026
@pruprakash

Copy link
Copy Markdown
Author

/ok to test a839369

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:misc Cross-cutting utilities, logging, helpers, and other changes docs Documentation-only updates or documentation debt needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant