diff --git a/skills/aimx-hydra-lightning-builder/SKILL.md b/skills/aimx-hydra-lightning-builder/SKILL.md index cf0e3b0..2062c12 100644 --- a/skills/aimx-hydra-lightning-builder/SKILL.md +++ b/skills/aimx-hydra-lightning-builder/SKILL.md @@ -54,13 +54,23 @@ Never edit, format, sync dependencies, generate files, or run mutation/codegen c Read `references/architecture.md` before scaffold or migration work. -- `configs/.yaml` composes `data`, `datamodule`, `model`, `plmodule`, `trainer`, `callbacks`, `logger`, `paths`, `accelerate`, and optional `experiment`. +- `configs/.yaml` composes `datamodule`, `model`, `plmodule`, `trainer`, `callbacks`, `logger`, `paths`, `accelerate`, `opt`, and `experiment`. +- Keep `configs/.yaml` as the baseline and select experiment deltas with `experiment=`, where `configs/experiment/.yaml` uses Hydra `override` defaults and parameter overrides. - `src/train.py` seeds, instantiates configured objects, logs hyperparameters, and calls `trainer.fit/validate/test`. - `BaseLitModule` owns `cfg`, `cfg.model` instantiation, optimizer/scheduler, compile/SDPA options, and shared trace helpers. - Task modules own batch parsing, loss, metrics, and prediction/evaluation outputs. -- DataModules own splits, dataloaders, sampler/collate policy, and data preparation boundaries. +- DataModules own splits, dataloaders, sampler/collate policy, and data preparation boundaries. Prefer dataset samples and batches as pytrees so task modules can evolve without positional tuple churn. - Aim trace uses Lightning loggers for scalars and explicit `experiment.track(...)` for images/distributions. +## Design Principles + +- Keep high cohesion inside modules and low coupling across modules. +- Let config define how an experiment runs; let code define what the domain operation means. +- Keep inheritance trees shallow and explicit. Prefer composition through Hydra-configured modules when behavior varies. +- Keep baseline defaults separate from experiment deltas. Experiments should override choices and parameters, not copy whole config trees. +- Keep optimizer and scheduler policy in `opt`; experiments override `opt` values instead of hiding optimizer settings under `model` or `trainer`. +- Use domain adapters for domain-specific behavior. Shared bases define contracts and common mechanics; child adapters implement radar, satellite, vision-frame, tabular, sequence, or other domain semantics. + ## References - `references/architecture.md`: core relationships and file layout. diff --git a/skills/aimx-hydra-lightning-builder/assets/template-repo/README.md b/skills/aimx-hydra-lightning-builder/assets/template-repo/README.md index d537ab6..73787eb 100644 --- a/skills/aimx-hydra-lightning-builder/assets/template-repo/README.md +++ b/skills/aimx-hydra-lightning-builder/assets/template-repo/README.md @@ -7,13 +7,18 @@ Hydra + Lightning + Aim template for Aimx AutoResearch. ```bash uv sync uv run python src/train.py trainer.fast_dev_run=true trainer.logger=false +uv run python src/train.py experiment=exp trainer.fast_dev_run=true trainer.logger=false uv run pytest ``` +Use `experiment=` to apply a file from `configs/experiment/.yaml`. +Experiment yaml files should override config groups and values such as +`model`, `datamodule`, `trainer`, `opt`, `accelerate`, and `logger`. + Enable Aim logging by leaving `trainer.logger=true` and using `logger=aim`. ```bash -uv run python src/train.py +uv run python src/train.py experiment=exp aimx query params "run.hash != ''" --repo . aimx query metrics "metric.name != ''" --repo . aimx query metrics "metric.name == 'acc'" --repo . --json diff --git a/skills/aimx-hydra-lightning-builder/assets/template-repo/configs/accelerate/default.yaml b/skills/aimx-hydra-lightning-builder/assets/template-repo/configs/accelerate/default.yaml index 01b54c5..2c850ea 100644 --- a/skills/aimx-hydra-lightning-builder/assets/template-repo/configs/accelerate/default.yaml +++ b/skills/aimx-hydra-lightning-builder/assets/template-repo/configs/accelerate/default.yaml @@ -1,3 +1,4 @@ compile: false precision: "32-true" fp32_matmul_precision: "highest" +sdpa: ["efficient", "flash", "math"] diff --git a/skills/aimx-hydra-lightning-builder/assets/template-repo/configs/experiment/exp.yaml b/skills/aimx-hydra-lightning-builder/assets/template-repo/configs/experiment/exp.yaml new file mode 100644 index 0000000..3e79623 --- /dev/null +++ b/skills/aimx-hydra-lightning-builder/assets/template-repo/configs/experiment/exp.yaml @@ -0,0 +1,39 @@ +# @package _global_ + +# Run with: +# uv run python src/train.py experiment=exp + +defaults: + - override /datamodule: dummy + - override /model: mlp + - override /plmodule: classifier + - override /callbacks: default + - override /trainer: default + - override /opt: default + - override /accelerate: default + - override /logger: aim + +task_name: train_exp +tags: ["exp", "{{ preset }}"] + +seed: 42 + +autoresearch: + experiment_name: "{{ project_name }}-exp" + +trainer: + max_epochs: 2 + gradient_clip_val: 0.5 + +datamodule: + batch_size: 32 + +model: + hidden_dim: 32 + +opt: + optimizer: + lr: 0.002 + +accelerate: + compile: false diff --git a/skills/aimx-hydra-lightning-builder/assets/template-repo/configs/opt/default.yaml b/skills/aimx-hydra-lightning-builder/assets/template-repo/configs/opt/default.yaml index f079338..5ad8d3c 100644 --- a/skills/aimx-hydra-lightning-builder/assets/template-repo/configs/opt/default.yaml +++ b/skills/aimx-hydra-lightning-builder/assets/template-repo/configs/opt/default.yaml @@ -2,3 +2,9 @@ optimizer: _target_: torch.optim.AdamW lr: 0.001 weight_decay: 0.0 + +scheduler: + _target_: torch.optim.lr_scheduler.CosineAnnealingLR + _partial_: True + T_max: 10 + eta_min: 0 diff --git a/skills/aimx-hydra-lightning-builder/assets/template-repo/configs/train.yaml b/skills/aimx-hydra-lightning-builder/assets/template-repo/configs/train.yaml index 39acb75..93a46ec 100644 --- a/skills/aimx-hydra-lightning-builder/assets/template-repo/configs/train.yaml +++ b/skills/aimx-hydra-lightning-builder/assets/template-repo/configs/train.yaml @@ -10,6 +10,7 @@ defaults: - logger: aim - opt: default - accelerate: default + - experiment: null task_name: train tags: ["dev", "{{ preset }}"] diff --git a/skills/aimx-hydra-lightning-builder/assets/template-repo/src/__package__/datamodules/dummy.py b/skills/aimx-hydra-lightning-builder/assets/template-repo/src/__package__/datamodules/dummy.py index 5681e85..1bb40f5 100644 --- a/skills/aimx-hydra-lightning-builder/assets/template-repo/src/__package__/datamodules/dummy.py +++ b/skills/aimx-hydra-lightning-builder/assets/template-repo/src/__package__/datamodules/dummy.py @@ -2,7 +2,22 @@ import torch from lightning import LightningDataModule -from torch.utils.data import DataLoader, TensorDataset, random_split +from torch.utils.data import DataLoader, Dataset, random_split + + +class PytreeClassificationDataset(Dataset): + def __init__(self, x: torch.Tensor, y: torch.Tensor) -> None: + self.x = x + self.y = y + + def __len__(self) -> int: + return int(self.x.shape[0]) + + def __getitem__(self, index: int) -> dict[str, dict[str, torch.Tensor]]: + return { + "input": {"x": self.x[index]}, + "target": {"label": self.y[index]}, + } class RandomClassificationDataModule(LightningDataModule): @@ -25,7 +40,7 @@ def setup(self, stage: str | None = None) -> None: x = torch.randn(int(self.hparams.num_samples), int(self.hparams.num_features), generator=generator) weights = torch.randn(int(self.hparams.num_features), int(self.hparams.num_classes), generator=generator) y = torch.argmax(x @ weights, dim=1) - dataset = TensorDataset(x, y) + dataset = PytreeClassificationDataset(x, y) train_len = max(1, int(0.8 * len(dataset))) val_len = len(dataset) - train_len self.train_dataset, self.val_dataset = random_split(dataset, [train_len, val_len], generator=generator) diff --git a/skills/aimx-hydra-lightning-builder/assets/template-repo/src/__package__/plmodules/__init__.py b/skills/aimx-hydra-lightning-builder/assets/template-repo/src/__package__/plmodules/__init__.py index fbb5a18..ba7d4da 100644 --- a/skills/aimx-hydra-lightning-builder/assets/template-repo/src/__package__/plmodules/__init__.py +++ b/skills/aimx-hydra-lightning-builder/assets/template-repo/src/__package__/plmodules/__init__.py @@ -1,3 +1,93 @@ +from __future__ import annotations + +import hydra +import lightning as L +import torch +from torch.nn.attention import SDPBackend, sdpa_kernel +from omegaconf import DictConfig + + +class BaseLitModule(L.LightningModule): + def __init__(self, cfg: DictConfig) -> None: + super().__init__() + + self.save_hyperparameters(logger=False) + self.cfg = cfg + self.net = hydra.utils.instantiate(cfg.model) + self._net_compiled = False + + sdpa_map = { + "cudnn": SDPBackend.CUDNN_ATTENTION, + "math": SDPBackend.MATH, + "efficient": SDPBackend.EFFICIENT_ATTENTION, + "flash": SDPBackend.FLASH_ATTENTION, + } + + self.sdpa_backends = [sdpa_map[backend] for backend in self.cfg.accelerate.get("sdpa", ["math"])] + + def forward(self, *args, **kwargs): + return self._model_forward(*args, **kwargs) + + def _model_forward(self, *args, **kwargs): + with sdpa_kernel(self.sdpa_backends): + return self.net(*args, **kwargs) + + def setup(self, stage: str) -> None: + if self.cfg.accelerate.compile and stage == "fit" and hasattr(torch, "compile") and not self._net_compiled: + self.net = torch.compile(self.net) + self._net_compiled = True + + def get_lr_scheduler(self, optimizer): + scheduler = hydra.utils.instantiate(self.cfg.opt.scheduler)(optimizer=optimizer) + kwargs = { + key: value for key, value in self.cfg.opt.items() if key not in ["optimizer", "scheduler"] + } + return { + "scheduler": scheduler, + **kwargs, + } + + def get_optimizer(self): + if self.cfg.opt.optimizer._target_ == "torch.optim.AdamW": + optimizer = hydra.utils.instantiate( + self.cfg.opt.optimizer, + params=filter(lambda p: p.requires_grad, self.net.parameters()), + ) + elif self.cfg.opt.optimizer._target_ == "colossalai.nn.optimizer.HybridAdam": + optimizer = hydra.utils.instantiate( + self.cfg.opt.optimizer, + model_params=filter(lambda p: p.requires_grad, self.net.parameters()), + ) + else: + optimizer = hydra.utils.instantiate( + self.cfg.opt.optimizer, + params=filter(lambda p: p.requires_grad, self.net.parameters()), + ) + return optimizer + + def configure_optimizers(self): + optimizer = self.get_optimizer() + if not self.cfg.opt.get("scheduler"): + return optimizer + + lr_scheduler = self.get_lr_scheduler(optimizer) + return { + "optimizer": optimizer, + "lr_scheduler": lr_scheduler, + } + + def _aim_experiments(self): + for logger in self.loggers: + experiment = getattr(logger, "experiment", None) + if experiment is not None and hasattr(experiment, "track"): + yield experiment + + def _instantiate_metric(self, name: str, defaults: dict[str, dict[str, object]]): + metrics_cfg = self.cfg.get("metrics", {}) + metric_cfg = metrics_cfg[name] if name in metrics_cfg else defaults[name] + return hydra.utils.instantiate(metric_cfg) + + from {{ package_name }}.plmodules.classifier import ClassificationModule -__all__ = ["ClassificationModule"] +__all__ = ["BaseLitModule", "ClassificationModule"] diff --git a/skills/aimx-hydra-lightning-builder/assets/template-repo/src/__package__/plmodules/classifier.py b/skills/aimx-hydra-lightning-builder/assets/template-repo/src/__package__/plmodules/classifier.py index 88fbef8..6b11c0c 100644 --- a/skills/aimx-hydra-lightning-builder/assets/template-repo/src/__package__/plmodules/classifier.py +++ b/skills/aimx-hydra-lightning-builder/assets/template-repo/src/__package__/plmodules/classifier.py @@ -1,38 +1,39 @@ from __future__ import annotations -import hydra -import lightning as L import torch import torch.nn.functional as F from omegaconf import DictConfig +from {{ package_name }}.plmodules import BaseLitModule -class ClassificationModule(L.LightningModule): + +class ClassificationModule(BaseLitModule): def __init__(self, cfg: DictConfig) -> None: - super().__init__() - self.save_hyperparameters(logger=False) - self.cfg = cfg - self.net = hydra.utils.instantiate(cfg.model) + super().__init__(cfg) - def forward(self, x: torch.Tensor) -> torch.Tensor: - return self.net(x) + def _parse_batch(self, batch: dict[str, dict[str, torch.Tensor]]) -> tuple[torch.Tensor, torch.Tensor]: + return batch["input"]["x"], batch["target"]["label"] - def _shared_step(self, batch, mode: str) -> torch.Tensor: - x, y = batch + def _shared_step(self, batch, mode: str) -> dict[str, torch.Tensor]: + x, y = self._parse_batch(batch) logits = self(x) - loss = F.cross_entropy(logits, y) preds = torch.argmax(logits, dim=1) - acc = (preds == y).float().mean() - on_step = mode == "train" - self.log(f"{mode}/loss", loss, on_step=on_step, on_epoch=True, prog_bar=True) - self.log(f"{mode}/acc", acc, on_step=on_step, on_epoch=True, prog_bar=True) - return loss + res = { + "y_hat": preds, + "y": y, + } + if mode in ["train", "val"]: + loss = F.cross_entropy(logits, y) + acc = (preds == y).float().mean() + on_step = mode == "train" + self.log(f"{mode}/loss", loss, on_step=on_step, on_epoch=True, prog_bar=True) + self.log(f"{mode}/acc", acc, on_step=on_step, on_epoch=True, prog_bar=True) + res["loss"] = loss + return res def training_step(self, batch, batch_idx: int) -> torch.Tensor: - return self._shared_step(batch, "train") - - def validation_step(self, batch, batch_idx: int) -> None: - self._shared_step(batch, "val") + res = self._shared_step(batch, "train") + return res["loss"] - def configure_optimizers(self): - return hydra.utils.instantiate(self.cfg.opt.optimizer, params=self.parameters()) + def validation_step(self, batch, batch_idx: int) -> dict[str, torch.Tensor]: + return self._shared_step(batch, "val") diff --git a/skills/aimx-hydra-lightning-builder/assets/template-repo/tests/test_fast_dev_run.py b/skills/aimx-hydra-lightning-builder/assets/template-repo/tests/test_fast_dev_run.py index d127214..9577172 100644 --- a/skills/aimx-hydra-lightning-builder/assets/template-repo/tests/test_fast_dev_run.py +++ b/skills/aimx-hydra-lightning-builder/assets/template-repo/tests/test_fast_dev_run.py @@ -3,12 +3,16 @@ import subprocess import sys +import pytest -def test_fast_dev_run() -> None: + +@pytest.mark.parametrize("overrides", [(), ("experiment=exp",)]) +def test_fast_dev_run(overrides: tuple[str, ...]) -> None: result = subprocess.run( [ sys.executable, "src/train.py", + *overrides, "trainer.fast_dev_run=true", "trainer.logger=false", "trainer.enable_progress_bar=false", @@ -18,3 +22,10 @@ def test_fast_dev_run() -> None: text=True, ) assert result.returncode == 0, result.stderr + + +def test_plmodule_exports() -> None: + from {{ package_name }}.plmodules import BaseLitModule, ClassificationModule + + assert BaseLitModule.__name__ == "BaseLitModule" + assert ClassificationModule.__name__ == "ClassificationModule" diff --git a/skills/aimx-hydra-lightning-builder/references/architecture.md b/skills/aimx-hydra-lightning-builder/references/architecture.md index 230e984..1d2b831 100644 --- a/skills/aimx-hydra-lightning-builder/references/architecture.md +++ b/skills/aimx-hydra-lightning-builder/references/architecture.md @@ -16,9 +16,31 @@ Use a primary config such as `configs/train.yaml` with defaults for: - `logger` - `paths` - `accelerate` -- optional `experiment` +- `opt` +- `experiment` -Experiment configs should override choices and hyperparameters, not duplicate the whole tree. + +Baseline defaults live in the regular config groups, such as `datamodule/default.yaml`, `model/default.yaml`, `trainer/default.yaml`, `logger/default.yaml`, and `opt/default.yaml`. Keep the primary config as the reproducible baseline and select experiment deltas explicitly: + +```bash +uv run python src/train.py experiment=exp +``` + +Experiment files live under `configs/experiment/.yaml`, use `# @package _global_`, and override config-group choices with Hydra defaults such as `override /model: mlp` or `override /opt: default`. They should override choices and hyperparameters, not duplicate the whole tree. + +Keep optimizer and scheduler policy in `opt`. Override learning rates, weight decay, scheduler settings, and optimizer choices through `opt` in the experiment yaml instead of placing optimizer state under `model`, `plmodule`, or `trainer`. + +Config defines how an experiment runs: + +- which datamodule, model, plmodule, callbacks, logger, and trainer are instantiated; +- paths, batch sizes, worker counts, optimizer settings, precision, and accelerator choices; +- experiment names, tags, objective metadata, and evidence switches. + +Code defines what the operation means: + +- how a domain batch is parsed; +- how targets, losses, metrics, and predictions are computed; +- what qualitative artifacts or distribution traces mean for the domain. ## Runtime Layer @@ -41,12 +63,61 @@ Experiment configs should override choices and hyperparameters, not duplicate th - apply compile, precision, or SDPA settings from `cfg.accelerate`; - provide helper methods for Aim experiments when explicit traces are needed. -Task subclasses should own only domain logic: batch parsing, forward call, loss, metrics, prediction outputs, and optional qualitative trace artifacts. +Task subclasses should inherit from `BaseLitModule` and own only domain logic: batch parsing, forward call, loss, metrics, prediction outputs, and optional qualitative trace artifacts. Do not duplicate optimizer setup, model instantiation, logger access, compile handling, or trainer construction in each task module. + +Keep inheritance trees shallow and explicit: + +- use one shared base for contracts and mechanics; +- use one child adapter for the domain task; +- avoid framework-like hierarchies where behavior is spread across several parent classes. + +Prefer composition when behavior varies. Swap models, losses, metrics, callbacks, data sources, and loggers through Hydra config instead of adding inheritance layers. + +## Domain Adapter Pattern + +Use domain adapters when multiple datasets or modalities should share the same AutoResearch contract but differ in domain semantics. Shared bases define stable contracts and common mechanics; child adapters translate domain data into that contract. + +Examples: + +- radar adapters parse radar tensors, lead times, geospatial masks, and forecast targets; +- satellite adapters parse channels, tiles, projections, and cloud or retrieval targets; +- vision-frame adapters parse images, labels, boxes, masks, or frame windows; +- tabular adapters parse feature tables, categorical encodings, sample weights, and targets; +- sequence adapters parse token, sensor, event, or time-series windows. + +A domain adapter should own: + +- batch parsing and validation; +- target construction and masking; +- domain metrics and loss inputs; +- prediction formatting; +- optional qualitative artifacts and Aim traces. + +A domain adapter should not own: + +- trainer construction; +- logger construction; +- config composition; +- filesystem layout; +- sweep or experiment orchestration. ## Data Layer `LightningDataModule` classes own data preparation and dataloaders. Keep user data paths in config. Use dummy/random data in templates so fast validation does not depend on private datasets. +Keep data modules cohesive: they prepare datasets, splits, dataloaders, sampling, and collation. Keep model math and task losses out of data modules. + +Prefer dataset samples as pytrees, such as nested dictionaries or dataclasses containing tensors, arrays, masks, metadata, or target leaves. Pytrees keep model and task design flexible because new leaves can be added without changing every positional tuple unpack. Let the DataLoader collate the pytree when possible, and let the task adapter parse named leaves into model inputs, targets, masks, and metadata. + +Use stable leaf names that express domain meaning: + +- `inputs` or `input` for tensors passed to the model; +- `targets` or `target` for supervised labels or forecast targets; +- `metadata` for ids, timestamps, coordinates, horizon, or source provenance; +- `mask` or domain-specific masks for valid regions, sample weights, or loss masks. + +Avoid positional tuple batches in templates and migrations unless an upstream dataset API forces them. If the upstream API returns tuples, adapt them into a pytree at the dataset or collate boundary. + ## Evidence Layer Use Lightning `self.log` for scalar metrics. Use Aim `experiment.track(...)` for images and distributions. Keep evidence names stable and context-rich so `aimx` can query them later. diff --git a/skills/aimx-hydra-lightning-builder/references/migration-audit.md b/skills/aimx-hydra-lightning-builder/references/migration-audit.md index cc1cf5c..b2605a2 100644 --- a/skills/aimx-hydra-lightning-builder/references/migration-audit.md +++ b/skills/aimx-hydra-lightning-builder/references/migration-audit.md @@ -10,7 +10,15 @@ Inspect: - training entrypoints; - Hydra config root and defaults composition; - model, datamodule, plmodule, trainer, callback, logger config groups; +- `opt` config group for optimizer and scheduler policy; +- `experiment` config group with explicit Hydra override files; - LightningModule and LightningDataModule classes; +- dataset item and batch shape, preferring named pytree leaves over positional tuples; +- domain adapter boundaries for radar, satellite, vision-frame, tabular, sequence, or other domain-specific logic; +- shared bases that hold only contracts and common mechanics; +- shallow inheritance trees with explicit child adapters; +- high-cohesion modules with low cross-module coupling; +- baseline defaults separated from experiment deltas; - metric logging through `self.log`; - AimLogger config and direct `experiment.track(...)` traces; - hyperparameter logging; @@ -22,8 +30,12 @@ Inspect: 1. Establish the AutoResearch contract. 2. Add or normalize Hydra config groups. 3. Move runtime orchestration into `src/train.py`. -4. Adapt model/data/task code into Lightning boundaries. -5. Add Aim/Aimx evidence conventions. -6. Add fast validation tests. +4. Separate baseline defaults from experiment deltas. +5. Move optimizer and scheduler choices into `opt`. +6. Adapt datasets and collate outputs into named pytrees. +7. Adapt model/data/task code into Lightning boundaries. +8. Introduce domain adapters only where domain semantics differ. +9. Add Aim/Aimx evidence conventions. +10. Add fast validation tests. Keep migration patches small and reversible.