Skip to content

fix: deregister NIXL scratch tensors after refit receive#457

Open
yafshar wants to merge 2 commits into
ai-dynamo:kavink/post-2389-phase3-4from
yafshar:yafshar/fix-nixl-scratch-registration-leak
Open

fix: deregister NIXL scratch tensors after refit receive#457
yafshar wants to merge 2 commits into
ai-dynamo:kavink/post-2389-phase3-4from
yafshar:yafshar/fix-nixl-scratch-registration-leak

Conversation

@yafshar

@yafshar yafshar commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes a NIXL registration leak in the scratch receive path.

MxRefitReceiver.receive_weights_scratch() allocates fresh GPU scratch tensors per receive and registers them via NixlTransferManager.register_tensors(). Previously, this flow invoked register_memory() without retaining or releasing the returned registration descriptors, causing registrations to accumulate across repeated refits.

This change scopes scratch tensor registrations to the duration of a single receive and ensures they are deregistered immediately after the transfer completes. Persistent tensor registrations are now tracked and released on shutdown, preserving existing publisher assumptions about stable, long-lived metadata.

Changes

  • Add registration descriptor tracking to NixlTransferManager.
  • Introduce temporary_registered_tensors() for scoped, one-shot registrations.
  • Use temporary registration in receive_weights_scratch().
  • Ensure proper teardown by explicitly shutting down the wrapped MxRefitReceiver from MxV2RefitReceiver.shutdown().
  • Add unit tests (mocked) covering:
    • registration tracking
    • shutdown cleanup
    • temporary registration lifecycle

Validation

Baseline metadata grew linearly across repeated scratch receives:

9878 -> 19190 -> 28502 -> 37814 -> 47126

After this patch, metadata stayed bounded:

count: 21
first 10: [9882, 9878, 9878, 9878, 9878, 9878, 9878, 9878, 9878, 9878]
last 10: [9878, 9878, 9878, 9878, 9878, 9878, 9878, 9878, 9878, 9878]
first: 9882 last: 9878 delta: -4

Validation command:

python benchmarks/bench_elastic_scaling.py \
  --mode=live \
  --scenario=elastic_scale \
  --num-receivers=1 \
  --steps=30 \
  --num-tensors=32 \
  --tensor-bytes=$((8*1024*1024)) \
  --trainer-device-id=0 \
  --receiver-device-base=1 \
  --step-interval=1.0 \
  --join-interval=0.2 \
  --trainer-warmup=2.0 \
  --cycle-timeout=120 \
  --deadline=900 \
  --output=/tmp/elastic_fixed.json \
  -v 2>&1 | tee /tmp/elastic_fixed.log

@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Yaser Afshar <yaser.afshar@intel.com>
@yafshar yafshar force-pushed the yafshar/fix-nixl-scratch-registration-leak branch from 17498de to 8b952f6 Compare June 26, 2026 18:57
@yafshar yafshar changed the title Fix NIXL scratch registration leak fix: deregister NIXL scratch tensors after refit receive Jun 26, 2026
@github-actions github-actions Bot added the fix label Jun 26, 2026
@yafshar yafshar marked this pull request as ready for review June 26, 2026 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant