Enable DINOv3 PP microbatching + gradient accumulation with loss scaling and validation by roulbac · Pull Request #5 · roulbac/titanpg

roulbac · 2026-03-05T13:34:30Z

Motivation

Pipeline-parallel v1 was unusable for pp>1 due to an impossible microbatch equality guard and lack of gradient-accumulation support, which prevented multi-stage microbatching required by PyTorch PP.
Make PP usable on multi-GPU topologies (PP + FSDP/HSDP/TP) by validating realistic microbatch contracts and normalizing gradients across PP microbatches and trainer-level gradient accumulation.
Accept documented small numerical drift from Sinkhorn teacher assignment when microbatching is enabled rather than blocking PP usage.

Description

Replace strict equality check in pipeline_dinov3 with TorchTitan-compatible validation that enforces pipeline_parallel_microbatch_size > 0, training.local_batch_size % pipeline_parallel_microbatch_size == 0, and n_microbatches = local_batch_size // pipeline_parallel_microbatch_size >= pipeline_parallel_degree (file: dinov3/infra/pipeline.py).
Add explicit loss_scale: float = 1.0 plumbing into the model/training path so the backward pass is scaled correctly under PP microbatching and trainer GA: SSLMetaArch.forward_backward(..., loss_scale) (file: dinov3/train/ssl_meta_arch.py), DinoV3PipelineStageModel.forward(..., loss_scale) (file: dinov3/infra/pipeline.py) and DinoV3SSLModel.forward_backward(..., loss_scale) (file: dinov3/model/model.py).
Enable gradient accumulation in DinoV3Trainer by removing the previous NotImplemented guard, introducing _ga_steps, iterating GA microbatches per optimizer step, and applying loss scaling formulas: non-PP loss_scale = 1/ga_steps and PP loss_scale = 1/(ga_steps * pp_microbatches_per_step) where pp_microbatches_per_step = local_batch_size / pipeline_parallel_microbatch_size (file: dinov3/trainer.py).
Update throughput/progress counters to account for effective per-step batch size under GA so token/images accounting reflects effective_local_batch = local_batch_size * gradient_accumulation_steps and effective_global_batch = effective_local_batch * data_parallel_degree (file: dinov3/trainer.py).
Update docs to reflect the new PP contract, microbatch divisibility requirement, GA formulas, and a note about Sinkhorn microbatch drift (files: docs/02-how-training-works.md, docs/10-debugging-troubleshooting.md).
Add/update unit tests covering the new validation and GA behavior: tests/test_dinov3_pipeline_dispatch.py (microbatch divisibility & n_microbatches >= pp_degree), tests/test_dinov3_trainer_pp.py (PP+GA loss scaling and non-PP GA behavior), and tests/test_dinov3_trainer_metrics_accounting.py (GA throughput accounting). Several source files were modified: dinov3/infra/pipeline.py, dinov3/train/ssl_meta_arch.py, dinov3/model/model.py, dinov3/trainer.py.

Testing

Ran Python bytecode compilation with python -m compileall on modified modules and tests which succeeded for the updated files.
Ran the test subset with pytest in the current environment which failed to collect tests due to missing runtime packages (torch and torchtitan), so full unit test execution could not be completed here; the new/updated unit tests target the PP validation, PP+GA trainer behavior, non-PP GA behavior, and throughput accounting and should pass in an environment with the project dependencies installed.
Manual inspection and small smoke usage: new validation and loss_scale plumbing follow the documented formulas and are exercised by the added unit tests (to be executed in CI or a full dev environment).

Codex Task

Enable DINOv3 PP microbatching with GA-aware loss scaling

b0f3441

roulbac added the codex label Mar 5, 2026 — with ChatGPT Codex Connector

roulbac added 2 commits March 5, 2026 13:37

Add GitHub Actions workflow for Python test execution

2abddf1

Fix pytest failures in local and CI runs

243ccc1

roulbac merged commit 6f53650 into dinov3 Mar 5, 2026
1 check passed

roulbac deleted the codex/enable-dinov3-pp-on-8xgpu-with-ga branch March 6, 2026 08:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable DINOv3 PP microbatching + gradient accumulation with loss scaling and validation#5

Enable DINOv3 PP microbatching + gradient accumulation with loss scaling and validation#5
roulbac merged 3 commits into
dinov3from
codex/enable-dinov3-pp-on-8xgpu-with-ga

roulbac commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

roulbac commented Mar 5, 2026

Motivation

Description

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant