Skip to content

feat: Empty Block Complementing#343

Open
ravisoundar wants to merge 1 commit into
mainfrom
rs-complement
Open

feat: Empty Block Complementing#343
ravisoundar wants to merge 1 commit into
mainfrom
rs-complement

Conversation

@ravisoundar

Copy link
Copy Markdown
Collaborator

Description

Empty Block Complementing for Slurm Block Topology.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • All commits are signed off per DCO (git commit -s).

@ravisoundar ravisoundar requested a review from dmitsh as a code owner June 4, 2026 06:00
@copy-pr-bot

copy-pr-bot Bot commented Jun 4, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR implements empty-block complementing for Slurm block topology: when an accelerator domain spans more hosts than baseBlockSize it is split across multiple base blocks, and when a topology tree has structural gaps (domains owned by other partitions), placeholder empty blocks are inserted so the scheduler sees the correct aggregate shape. It also refactors DomainMap to carry rich HostInfo structs instead of raw instance-ID strings, enabling per-node domain metadata throughout the translate pipeline.

  • DomainMap refactor (domain.go, topology.go): value type changed from map[string]string to map[string]*HostInfo; AddHostInfo centralises insertion; a nil guard is added for the host-info lookup in initBlocks.
  • Complement core (block_complement.go, block_tree.go): complementBlocks filters the global domain map to partition-owned nodes, packs them into a shaped tree, and returns the flat block list only when interior gaps or a block-count increase warrants replacement; the Slurm block path (toBlockTopology) refreshes nodeInfo.blockID after complement so GetNodeTopologySpec stays consistent with the emitted file.
  • Per-partition YAML path (yaml.go): getBlockTopologyUnit now populates bInfo.name (needed for domainsForBlocks lookup) and calls complementBlocks before emitting blocks, omitting Nodes= for placeholder slots.

Confidence Score: 4/5

Safe to merge for all current production code paths; the logic gap is not reachable through today's call sites but leaves the stated safety invariant incompletely enforced.

The shouldUseComplementedBlocks guard is designed to prevent block loss when some domains are absent from the global map, but the len(out) > len(input) branch can return true in that scenario if the tree's total capacity exceeds the input list length, silently dropping missing-domain blocks. The function's own docstring promises exactly this protection yet does not deliver it in the capacity-expansion sub-case, and there is no test covering that path.

pkg/translate/block_complement.go — specifically shouldUseComplementedBlocks and the required computation in complementBlocks

Important Files Changed

Filename Overview
pkg/translate/block_complement.go New file implementing empty-block complement logic; shouldUseComplementedBlocks has an incomplete safety guard that can incorrectly use the complemented output when tree-capacity expansion makes len(out) > len(input) despite some domains being absent
pkg/translate/block_tree.go New file with tree data structures and packing algorithms; logic is sound and well-tested with deterministic host sorting and correct slot filling
pkg/translate/block.go Calls complementBlocks and refreshes nodeInfo.blockID after complement to keep GetNodeTopologySpec in sync with emitted topology; empty-block output handled correctly
pkg/translate/topology.go Stores global DomainMap on NetworkTopology and adds nil guard for hostInfo lookup in initBlocks; straightforward and safe
pkg/translate/yaml.go Per-partition path now calls complementBlocks and enriches blockInfo with domain name; empty-block emission correctly omits Nodes= field
pkg/topology/domain.go Refactors DomainMap value type from map[string]string to map[string]*HostInfo; AddHostInfo centralises insertion and preserves the empty-domain guard
pkg/translate/block_complement_test.go Comprehensive tests for complement path; TestComplementPreservesInputWhenDomainsMissing covers the equal-length fallback but misses the capacity-expansion sub-case; one inline comment has an incorrect groupSize value
pkg/translate/block_tree_test.go Thorough unit tests for all tree primitives (packing, splitting, slot collection, fanout expansion); correct and complete

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["complementBlocks(blocks, blockSizes)"] --> B{"fanoutsPerLevel OK\n& nt.domains != nil?"}
    B -- No --> Z["return original blocks"]
    B -- Yes --> C["domainsForBlocks:\nfilter to partition-owned nodes"]
    C --> D{"len(domains) == 0?"}
    D -- Yes --> Z
    D -- No --> E["groupSizeFromDomains:\ncompute 2^n padding factor"]
    E --> F["packDomainsIntoBaseBlocks:\nsplit each domain into base blocks\npad to multiple of groupSize"]
    F --> G["expandFanoutsForCapacity:\ngrow last fanout tier until\ntree fits len(packed)"]
    G --> H["buildAggregateShape +\nmergeBaseBlocksIntoTree"]
    H --> I{"len(packed) == len(blocks)?"}
    I -- Yes: no splits/padding --> J["required = len(packed)"]
    I -- No: splits or missing domains --> K["required = totalBaseBlockSlots"]
    J & K --> L["blocksFromShapedTree(tree, byName, required)"]
    L --> M["shouldUseComplementedBlocks(input, out)"]
    M -- "len(out) < len(input)" --> Z
    M -- "hasEmptyBlockSlots(out)" --> N["return complemented out"]
    M -- "len(out) > len(input)" --> N
    M -- else --> Z
Loading

Reviews (12): Last reviewed commit: "feat: Empty Block Complementing" | Re-trigger Greptile

Comment thread pkg/translate/yaml.go
Comment thread pkg/translate/block_complement.go
Comment thread pkg/translate/topology.go Outdated
@ravisoundar ravisoundar force-pushed the rs-complement branch 3 times, most recently from 91ad1ef to c122256 Compare June 6, 2026 20:06
Comment thread pkg/translate/block_complement.go
@ravisoundar ravisoundar force-pushed the rs-complement branch 4 times, most recently from f048cb8 to eb4d16a Compare June 10, 2026 22:01
Comment thread pkg/translate/block.go
@ravisoundar ravisoundar force-pushed the rs-complement branch 3 times, most recently from 77b28c3 to dba4934 Compare June 10, 2026 23:32
Signed-off-by: Ravi Shankar <ravish@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant