Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
eb63835
Add infmax v3 branch for faster response
RogerLiu312 Mar 20, 2026
ab6a046
Bump version
RogerLiu312 Mar 20, 2026
7c668a7
Add composed sflow config in bulk csv results
RogerLiu312 Mar 20, 2026
e3bd97e
Fix runtime venv conflict
RogerLiu312 Mar 20, 2026
e8a8b3f
Add cli hint when user mistakenly input csv file for sflow -f
RogerLiu312 Mar 23, 2026
7e14112
Add chained error info for better debugging experience
RogerLiu312 Mar 23, 2026
4683096
Improve replicated task http probe
RogerLiu312 Mar 23, 2026
32ef20a
Fix probe timeout and improve http probe redundancy
RogerLiu312 Mar 24, 2026
efe650b
Refactor --bulk-input to make all cli entry behaviour consistent
RogerLiu312 Mar 24, 2026
bc42c87
Implement sbatch extra args resolution for CLI, enhancing support for…
RogerLiu312 Mar 31, 2026
cc474f3
Fix cases where some custom cluster do no support enroot containers
RogerLiu312 Apr 2, 2026
11e34b8
Add support for negative indices and open-ended slices in row selecti…
RogerLiu312 Apr 2, 2026
1695bdc
Add sflow_batch_dir column to results.csv for bulk input and submit o…
RogerLiu312 Apr 3, 2026
390442b
Enhance node resource management by adding support for negative indic…
RogerLiu312 Apr 15, 2026
fbe0093
Add variable expression support for domain info
RogerLiu312 Apr 16, 2026
c495d19
Enhance variable domain resolution for replica sweeps and update rela…
RogerLiu312 Apr 16, 2026
3ea9954
Add knowledge index
RogerLiu312 Apr 16, 2026
70a69bb
Remove agent files
RogerLiu312 Apr 17, 2026
19f673f
Update CLI behavior to ensure that `--set` values take precedence ove…
RogerLiu312 Apr 17, 2026
477e85a
Add support for resolving effective sflow version in batch scripts. E…
RogerLiu312 Apr 22, 2026
8f1e4de
Fix jinja2 expression wrapped variables are not correctly resolved wh…
RogerLiu312 Apr 24, 2026
ae63b3a
Enhance node resource configuration to support expression strings for…
RogerLiu312 Apr 24, 2026
6b280d7
Implement support for multiple readiness probes in task configuration…
RogerLiu312 Apr 27, 2026
a41e7da
Add release notes for sflow v0.2.1, documenting CLI and batch workflo…
RogerLiu312 Apr 28, 2026
760322e
Add contribution guides
RogerLiu312 Apr 28, 2026
c44f8ea
Update version to 0.2.1 in pyproject.toml to reflect the latest release.
RogerLiu312 Apr 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -233,4 +233,9 @@ workflow_outputs/
aiperf_artifacts/
.cursor/
tests/e2e_tests/sflow.sh
tests/e2e_tests/*_config.yaml
tests/e2e_tests/*_config.yaml
tests/e2e_tests/.sflow_venv.lock
.gitnexus
.claude/
AGENTS.md
CLAUDE.md
138 changes: 137 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1 +1,137 @@
The project currently is not accepting external contributions.
# Contributing to sflow

Thank you for contributing to sflow. sflow is a declarative workflow descriptor that separates what to deploy from where to deploy it.

Contributors describe a workflow once in portable YAML -- tasks, dependencies, resources, launch methods, probes, artifacts, replicas, and sweeps -- and sflow executes the DAG through swappable backends. The current focus is Slurm, where sflow fills the workflow orchestration gap around `salloc`, `srun`, resource placement, and batch submission. Docker and Kubernetes backends are planned.

The repository also carries production-ready examples for NVIDIA Dynamo and LLM inference benchmarking, including modular SGLang, vLLM, and TensorRT-LLM workflows.

This guide explains how to keep changes reviewable, tested, and compatible with downstream co-development workflows.

## Contribution Scope

This project does not accept NVIDIA-external code contributions at this time. If you are an external user and have a bug report, feature request, documentation gap, or other issue that needs attention, please file an issue so maintainers can triage it.

NVIDIA-internal co-development is allowed. Internal contributors should follow the applicable internal engineering, review, and release process documentation in addition to the project-specific rules below.

## Issue Tracking

All enhancement requests, bug reports, documentation gaps, and behavior-change proposals should start with an issue or an internal tracking item.

- External users should file a GitHub issue with enough detail for maintainers to reproduce or understand the request.
- NVIDIA-internal contributors should link the relevant internal task or release tracking item when applicable.
- Feature work should be reviewed by maintainers before code review if it changes user-facing behavior, sample workflows, CLI semantics, or release behavior.
- If a change might break existing behavior, mark it clearly as a breaking change in the issue and pull request.

## Repository Layout

- `src/sflow/`: Python package source.
- `src/sflow/cli/`: CLI commands such as `run`, `batch`, `compose`, `sample`, and `visualize`.
- `src/sflow/app/`: application assembly and high-level workflow execution.
- `src/sflow/config/`: YAML loading, schema validation, and expression resolution.
- `src/sflow/core/`: core DAG, task, probe, backend, operator, artifact, and orchestration logic.
- `src/sflow/plugins/`: built-in backends, operators, probes, and artifact handlers.
- `examples/`: user-facing workflow examples used for local and Slurm regression coverage.
- `src/sflow/samples/`: packaged copies of sample workflows exposed by `sflow sample`.
- `tests/`: unit tests.
- `scripts/full_sample_tests.sh`: end-to-end and preflight regression coverage for shipped examples.
- `docs/`: user documentation and release notes.

## Development Setup

```bash
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"
pytest
```

Always activate the project virtual environment before running commands.

## Coding Guidelines

Keep changes narrowly scoped to the behavior you intend to modify. Prefer existing patterns in the surrounding code over new abstractions.

Please also:

- Avoid committing commented-out code.
- Avoid unrelated formatting churn.
- Keep pull requests focused on one concern. If several unrelated changes are needed, split them into separate pull requests and describe any dependency between them.
- Use clear commit and pull request titles. NVIDIA-internal changes should include the relevant internal tracking ID in the title when applicable.
- Target the branch requested by the relevant internal task or release process. Do not assume every fix belongs on `main`.

## Change Policy

For any feature change:

- Do not modify existing unit tests just to make the new behavior pass.
- Do not modify existing end-to-end cases in `scripts/full_sample_tests.sh` just to make the new behavior pass.
- Add new test coverage for the new behavior.
- Add or update a matching example under `examples/` so co-developed features are covered by future regression runs.
- If the example is meant to be available through `sflow sample`, keep the packaged copy under `src/sflow/samples/` in sync.
- Update user docs and release notes when the behavior is user-facing.

The only exception is an intentional breaking change. In that case, the pull request must clearly explain:

- What old behavior is being broken.
- Why compatibility is not preserved.
- Which existing tests or e2e cases were changed and why.
- How users should migrate.

## Tests and Examples

Every feature change should include focused tests near the changed behavior:

- CLI behavior: add or extend tests under `tests/unit/test_cli_*.py`.
- Config schema or resolver behavior: add or extend tests under `tests/unit/test_config_*.py`.
- Task graph, resource, replica, or probe behavior: add or extend tests under `tests/unit/test_app_assembly_*.py`, `tests/unit/test_core_*.py`, or probe-specific tests.
- Artifact behavior: add or extend tests under `tests/unit/test_artifacts_*.py`.

Add examples that exercise the feature in the same style users will copy:

- Local-only examples should be runnable without Slurm.
- Slurm examples should use variable defaults that can be overridden by `--set` or CSV columns.
- Modular examples should document required `missable_tasks` values when some tasks may be absent.
- Keep `examples/` and `src/sflow/samples/` aligned for packaged samples.

Before submitting a feature change, run the focused tests for your area and the relevant sample regression path:

```bash
pytest tests/unit/<targeted_test_file>.py
scripts/full_sample_tests.sh -P
```

For changes that affect sample workflows, also run the relevant mode:

```bash
scripts/full_sample_tests.sh -s -P # self-contained examples
scripts/full_sample_tests.sh -m -P # modular examples
```

Use `-S` only when you intend to submit real Slurm jobs.

## Documentation

Update documentation in the same change when behavior changes. Common locations:

- `docs/user/cli.md` for CLI flags and modes.
- `docs/user/configuration.md` and `docs/user/quick-reference.md` for YAML schema changes.
- `docs/user/resources.md`, `docs/user/probes.md`, `docs/user/variables.md`, or `docs/user/replicas.md` for feature-specific behavior.
- `docs/release_notes/` for release-facing summaries.

Do not add large generated or presentation artifacts to release notes unless they are intentionally part of the release.

## Pull Request Checklist

Before opening an NVIDIA-internal pull request:

- The issue or internal tracking item is linked.
- The change is scoped to one feature or fix.
- Existing behavior is preserved unless the PR explicitly declares a breaking change.
- New behavior has focused unit coverage.
- User-facing behavior has an example under `examples/`.
- Packaged samples under `src/sflow/samples/` are updated when applicable.
- Relevant docs and release notes are updated.
- Focused tests pass.
- Relevant `scripts/full_sample_tests.sh` preflight path passes or any skipped validation is explained.
- Performance, compatibility, or release risks are called out in the pull request description.
56 changes: 56 additions & 0 deletions docs/release_notes/RELEASE_NOTES_v0.2.1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# sflow v0.2.1 Release Notes

**Release date:** April 2026
**Previous release:** v0.2.0 (March 2026)

---

## Highlights

sflow v0.2.1 is a documentation and workflow polish release for the InfMax v3 migration path. It documents the branch behavior for CSV-driven execution, self-contained YAML batch submission, replica variable domains, node placement, and probe orchestration.

---

## User-Facing Changes

### CLI and Batch Workflows

- **`sflow run --bulk-input`** now has documented single-row CSV execution. Use `--row` with exactly one selector to run a specific CSV row.
- **Advanced `--row` selectors** are documented for `run`, `compose`, and `batch`: repeated flags, comma lists, Python-style slices with exclusive end, open-ended slices, and negative indices such as `--row=-1`.
- **`sflow batch --bulk-submit`** is documented for submitting self-contained YAML files, folders, or glob patterns without CSV merging.
- **Auto-derived node counts** are documented. Single-job and bulk-submit batch modes can derive `--nodes` from the Slurm backend; bulk-input mode requires either `--nodes` or a CSV node-count column.
- **`--sflow-version`** is documented for pinning the git ref installed by generated sbatch scripts.
- **Expression-aware `--sbatch-extra-args`** is documented. Extra sbatch directives can resolve `${{ variables.X }}` or shorthand `${{ X }}` from config defaults, CLI `--set`, and CSV row values.

### Variables and Replica Sweeps

- **Variable domain metadata** is documented through `${{ variables.NAME.domain }}`.
- **Replica sweep behavior** is clarified: `${{ variables.NAME }}` resolves to the per-replica value, while `${{ variables.NAME.domain }}` remains the full domain list.
- **Domain overrides via `--set`** are documented: JSON-style list values update the variable `domain`, and the variable value becomes the first list item.

### Resources and Placement

- **`resources.nodes.exclude`** is documented for removing nodes from the placement pool before applying `indices`, `count`, or GPU packing.
- **Negative node indices** are clarified, including the fact that negative `indices` are resolved after `exclude` filtering.
- **Default Slurm placement** is documented: when a task does not set `resources.nodes`, sflow passes the full backend allocation to `srun`.
- **GPU packing behavior** is documented, including multi-node expansion when a GPU request is an exact multiple of `gpus_per_node`.

### Probes

- **Probe timing defaults** are documented, including `timeout: 1200` for readiness probes and `each_check_timeout: 30`.
- **HTTP probes** (`http_get` and `http_post`) are documented with examples.
- **Multiple readiness probes** are documented as AND semantics: all readiness probes must trigger before a task becomes ready.
- **Failure probes** are documented as fail-fast signals that mark tasks as failed by probe and cancel downstream work.
- **Replica HTTP probe deduplication** is documented for parallel replicas with identical HTTP probes.

---

## Documentation Updated

- `docs/user/cli.md`
- `docs/user/variables.md`
- `docs/user/resources.md`
- `docs/user/probes.md`
- `docs/user/quick-reference.md`
- `docs/user/configuration.md`
- `docs/user/architecture.md`
6 changes: 3 additions & 3 deletions docs/user/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -204,9 +204,9 @@ stateDiagram-v2

| Command | Purpose | Key Options |
|---------|---------|-------------|
| **`sflow run`** | Execute a workflow | `--dry-run`, `--tui`, `--set/-s`, `--artifact/-a`, `--missable-tasks/-M`, `--extra-args`, `--output-dir`, `--log-level` |
| **`sflow batch`** | Generate Slurm sbatch scripts | `--submit`, `--bulk-input` (CSV sweeps), `--nodes`, `--partition`, `--account`, `--time`, `--resolve` |
| **`sflow compose`** | Merge multiple YAMLs into one | `--resolve`, `--validate`, `--bulk-input`, `--missable-tasks/-M`, `-o/--output` |
| **`sflow run`** | Execute a workflow | `--dry-run`, `--tui`, `--bulk-input/--row`, `--set/-s`, `--artifact/-a`, `--missable-tasks/-M`, `--extra-args`, `--output-dir`, `--log-level` |
| **`sflow batch`** | Generate Slurm sbatch scripts | `--submit`, `--bulk-input` (CSV sweeps), `--bulk-submit` (YAML folders), `--row`, `--nodes`, `--partition`, `--account`, `--time`, `--resolve`, `--sflow-version` |
| **`sflow compose`** | Merge multiple YAMLs into one | `--resolve`, `--validate`, `--bulk-input`, `--row`, `--missable-tasks/-M`, `-o/--output` |
| **`sflow visualize`** | Render DAG as image/mermaid | `--format` (png/svg/pdf/mermaid/dot), `--show-variables`, `--set/-s`, `--artifact/-a`, `--missable-tasks/-M` |
| **`sflow sample`** | List/copy example workflows | `--list`, `--force`, `-o/--output` |
| **`sflow skill`** | Copy agent skills into project (merges into existing directory) | `--list`, `--force` (overwrite existing files), `-o/--output` |
Expand Down
Loading
Loading