fix: deregister NIXL scratch tensors after refit receive by yafshar · Pull Request #457 · ai-dynamo/modelexpress

yafshar · 2026-06-26T18:51:28Z

Summary

Fixes a NIXL registration leak in the scratch receive path.

MxRefitReceiver.receive_weights_scratch() allocates fresh GPU scratch tensors per receive and registers them via NixlTransferManager.register_tensors(). Previously, this flow invoked register_memory() without retaining or releasing the returned registration descriptors, causing registrations to accumulate across repeated refits.

This change scopes scratch tensor registrations to the duration of a single receive and ensures they are deregistered immediately after the transfer completes. Persistent tensor registrations are now tracked and released on shutdown, preserving existing publisher assumptions about stable, long-lived metadata.

Changes

Add registration descriptor tracking to NixlTransferManager.
Introduce temporary_registered_tensors() for scoped, one-shot registrations.
Use temporary registration in receive_weights_scratch().
Ensure proper teardown by explicitly shutting down the wrapped MxRefitReceiver from MxV2RefitReceiver.shutdown().
Add unit tests (mocked) covering:
- registration tracking
- shutdown cleanup
- temporary registration lifecycle

Validation

Baseline metadata grew linearly across repeated scratch receives:

9878 -> 19190 -> 28502 -> 37814 -> 47126

After this patch, metadata stayed bounded:

count: 21
first 10: [9882, 9878, 9878, 9878, 9878, 9878, 9878, 9878, 9878, 9878]
last 10: [9878, 9878, 9878, 9878, 9878, 9878, 9878, 9878, 9878, 9878]
first: 9882 last: 9878 delta: -4

Validation command:

python benchmarks/bench_elastic_scaling.py \
  --mode=live \
  --scenario=elastic_scale \
  --num-receivers=1 \
  --steps=30 \
  --num-tensors=32 \
  --tensor-bytes=$((8*1024*1024)) \
  --trainer-device-id=0 \
  --receiver-device-base=1 \
  --step-interval=1.0 \
  --join-interval=0.2 \
  --trainer-warmup=2.0 \
  --cycle-timeout=120 \
  --deadline=900 \
  --output=/tmp/elastic_fixed.json \
  -v 2>&1 | tee /tmp/elastic_fixed.log

copy-pr-bot · 2026-06-26T18:51:32Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Yaser Afshar <yaser.afshar@intel.com>

…h-registration-leak

pull-request-size Bot added the size/L label Jun 26, 2026

yafshar mentioned this pull request Jun 26, 2026

feat(RL/post-2389): MX V2 Support in RL + WeightTransferEngine Support #349

Open

fix: prevent NIXL scratch registration leak

8b952f6

Signed-off-by: Yaser Afshar <yaser.afshar@intel.com>

yafshar force-pushed the yafshar/fix-nixl-scratch-registration-leak branch from 17498de to 8b952f6 Compare June 26, 2026 18:57

yafshar changed the title ~~Fix NIXL scratch registration leak~~ fix: deregister NIXL scratch tensors after refit receive Jun 26, 2026

github-actions Bot added the fix label Jun 26, 2026

yafshar marked this pull request as ready for review June 26, 2026 19:01

Merge branch 'kavink/post-2389-phase3-4' into yafshar/fix-nixl-scratc…

f769cad

…h-registration-leak

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: deregister NIXL scratch tensors after refit receive#457

fix: deregister NIXL scratch tensors after refit receive#457
yafshar wants to merge 2 commits into
ai-dynamo:kavink/post-2389-phase3-4from
yafshar:yafshar/fix-nixl-scratch-registration-leak

yafshar commented Jun 26, 2026

Uh oh!

copy-pr-bot Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

yafshar commented Jun 26, 2026

Summary

Changes

Validation

Uh oh!

copy-pr-bot Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant