test(ci): add fleet scale test for row 20 (~15 workers, CRD + P2P metadata) by tanushriya910 · Pull Request #458 · ai-dynamo/modelexpress

tanushriya910 · 2026-06-27T01:05:13Z

Summary

Adds the small-fleet scale test from ci/TEST_PLAN.md row 20 — a realistic scale test that runs ~15 vLLM workers sharing weights over ModelExpress P2P, all coordinated through the Kubernetes CRD metadata backend.

Approach

Raw Deployment — uses a plain Deployment + Service; the first worker to start loads from HF and becomes the source, the rest discover it via mx-server and pull via NIXL.
No pod anti-affinity — workers pack onto a100a MIG slices on shared nodes, which is what makes ~15 workers feasible on a pool with far fewer physical nodes.
NIXL over TCP (NIXL_UCX_TLS=tcp,cuda_copy) — MIG slices the GPU but not RDMA: each a100a node has a single IB NIC, so requesting rdma/ib: 1 per pod would cap schedulable replicas to the node count. TCP sidesteps that. cuda_copy is required alongside tcp so UCX stages GPU tensors through the CPU rather than treating them as host memory (without it, every non-source worker silently falls back to HF disk load).
Wave scaling (1→5→10→15) — the action scales the Deployment in waves and waits for the CR count at each step, so a failure localises to a specific wave rather than the full fleet at once.

What's included

.github/actions/run-mx-fleet-test/action.yml — composite action: deploys mx-server + fleet, waits for the source, scales in waves, runs pytest, cleans up.
ci/k8s/client/vllm/manifest-azure-fleet.yaml — the fleet Deployment + Service manifest.
ci/k8s/client/test_fleet_scale.py — asserts CR count, all CRs Ready, ≥ fleet_size - 1 P2P transfers, and end-to-end inference.
test-fleet-scale job wired into the workflow and the ci-status-check gate.
ci/TEST_PLAN.md row 20 → In CI.

Testing

Passing in CI (run).

…adata) Adds the small-fleet scale test from TEST_PLAN row 20. Uses a raw Kubernetes Deployment (no Dynamo), no pod anti-affinity so workers pack onto a100a MIG slices, and NIXL over TCP+CUDA (NIXL_UCX_TLS=tcp,cuda_copy) because MIG slices GPU but not RDMA — each node has one IB NIC, so requesting rdma/ib: 1 per pod would cap schedulable replicas to the node count. Plain "tcp" without cuda_copy causes UCX to treat GPU tensors as host memory and fall back to HF disk load. Scales in waves (1->5->10->15) to localise failures. Asserts at least fleet_size CRs (>= to tolerate Deployment surge pods), all CRs Ready, >= 14 P2P transfers, and one inference request served.

codecov · 2026-06-27T01:45:35Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

pull-request-size Bot added the size/XL label Jun 27, 2026

copy-pr-bot Bot temporarily deployed to automated-release June 27, 2026 01:05 Inactive

github-actions Bot added the test label Jun 27, 2026

tanushriya910 temporarily deployed to GITLAB June 27, 2026 01:05 — with GitHub Actions Inactive

tanushriya910 force-pushed the tanushriyas/fleet-scale-ci-test branch from a6787be to 23bad18 Compare June 27, 2026 01:38

tanushriya910 deployed to GITLAB June 27, 2026 01:38 — with GitHub Actions Active

copy-pr-bot Bot deployed to automated-release June 27, 2026 01:38 Active

copy-pr-bot Bot temporarily deployed to automated-release June 27, 2026 01:38 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(ci): add fleet scale test for row 20 (~15 workers, CRD + P2P metadata)#458

test(ci): add fleet scale test for row 20 (~15 workers, CRD + P2P metadata)#458
tanushriya910 wants to merge 1 commit into
mainfrom
tanushriyas/fleet-scale-ci-test

tanushriya910 commented Jun 27, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tanushriya910 commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

What's included

Testing

Uh oh!

codecov Bot commented Jun 27, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tanushriya910 commented Jun 27, 2026 •

edited

Loading