Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions .github/workflows/validate-problems.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: Validate Problems

on:
pull_request:
push:
branches: ["main"]

jobs:
contract:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Structural validation
run: python scripts/validate_problem.py --runtime none --format text
27 changes: 27 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,30 @@ The `problem.md` file should contain a description of the problem written in Mar
- `const`: boolean

Once you add a problem, make sure to test both correct (slow/fast) and incorrect submissions. Let us know if you encounter any issues/bugs!

## Contract And Validation

For new work, use the contract-first docs in this repo:

- `docs/problem-authoring-contract.md`
- `docs/problem-validation-contract.md`
- `docs/problem-automation-roadmap.md`

There is now a machine-readable validator:

```bash
python scripts/validate_problem.py --runtime none
python scripts/validate_problem.py relu softmax --runtime first
python scripts/validate_problem.py gemm-relu-divide --runtime all --enforce-wrong-answer-rejection
```

Recommended workflow:

1. pass structural validation in CI
2. validate locally on CUDA or Together H100
3. validate through the Modal-backed product runtime before merge when the runtime path matters

Templates for new problems live in:

- `templates/problem-template.def.py`
- `templates/problem-template.md`
149 changes: 149 additions & 0 deletions docs/problem-authoring-contract.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# Problem Authoring Contract

This document defines the backward-compatible authoring format for `tensara/problems`.

## Goals

- Make problem authoring deterministic for agents.
- Preserve compatibility with existing published problems.
- Separate authoring truth from runtime truth.
- Keep `sync-problems.ts` compatible while stricter validation is rolled out.

## Stable Problem Layout

Each problem lives in:

```text
problems/<slug>/
├── def.py
└── problem.md
```

Both files remain required.

## `problem.md` Contract

### Required Frontmatter

These fields are already expected by sync and remain required:

```yaml
slug: "relu"
title: "ReLU"
difficulty: "EASY"
author: "sarthak"
```

### Recommended Frontmatter

These are now the recommended fields for new problems. They are backward-compatible because unknown frontmatter is ignored by the current sync path.

```yaml
tags: ["activation-function"]
source:
kind: "kernelbench"
repo: "ScalingIntelligence/KernelBench"
path: "KernelBench/level2/63_Gemm_ReLU_Divide.py"
authoring:
mode: "exact-port" # exact-port | normalized
validation:
deterministic: true
sample_path: true
wrong_answer_rejection: true
runtime_targets: ["local-cuda", "modal-sample"]
```

### Content Rules

- `slug` must match the directory name.
- `difficulty` must be `EASY`, `MEDIUM`, or `HARD`.
- Markdown body should describe the mathematical contract, not implementation trivia.
- If the problem is adapted from an external source, include attribution in the body or `source` block.

## `def.py` Contract

### Required Class Shape

`def.py` must define one primary problem class that subclasses `Problem`.

Canonical class naming:

- directory slug: `gemm-relu-divide`
- class name: `gemm_relu_divide`

### Required Methods

New problems should implement:

- `reference_solution(self, *args)`
- `generate_test_cases(self)`
- `generate_sample(self)`
- `verify_result(self, expected_output, actual_output)`

Backward-compatible rule:

- existing problems are allowed to keep current behavior if they already work
- new validation treats extra required args on `generate_test_cases`, `generate_sample`, or `verify_result` as contract errors

### Parameters

Preferred:

- define `parameters = [...]` in `def.py`

Fallback still supported:

- override `get_function_signature(...)`
- include legacy `parameters` frontmatter in `problem.md`

New problems should define parameters in `def.py`, because that is the most machine-readable source for agents and validators.

### Test Case Shape

`generate_test_cases()` should return a list of dicts.

Each dict should contain:

- `name`: stable string label
- `create_inputs`: zero-arg callable returning the reference inputs

Additional keys such as dimensions, seed, or descriptive metadata are encouraged.

### Sample Shape

Canonical form:

- `generate_sample()` returns a single dict with the same shape as one test case

Backward-compatible form still accepted by the validator:

- a list containing exactly one dict

### Verifier Rules

`verify_result(...)` should:

- accept the exact reference output
- reject intentionally perturbed outputs
- return `(bool, debug_info)`
- provide debug info that is useful for automated repair

## Canonical Authoring Pattern

1. Define the mathematical contract in `problem.md`.
2. Define parameters in `def.py`.
3. Make `reference_solution(...)` deterministic.
4. Use `Problem.get_seed(...)` for seeded test generation.
5. Add one small sample case.
6. Add several larger generated cases.
7. Make verifier failures explicit and debuggable.

## Compatibility Policy

This contract is intentionally additive:

- existing frontmatter fields keep working
- old problems are not forced to adopt new optional metadata immediately
- validation distinguishes structural errors from migration warnings

The long-term direction is to move all new problems toward this contract and then tighten sync around it.
78 changes: 78 additions & 0 deletions docs/problem-automation-roadmap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Problem Automation Roadmap

This roadmap defines how `tensara/problems` becomes agent-friendly without breaking current workflows.

## Immediate Direction

The first priority is not contests. It is deterministic ingestion.

That means:

1. agents can author to a stable contract
2. validators can reject weak or broken problems automatically
3. accepted problems are safe to sync and publish

## Phase 1: Contract First

Ship:

- a stable authoring contract
- a stable validation contract
- a machine-readable validator
- structural CI on every PR
- templates for new problems

This PR implements Phase 1.

## Phase 2: Stronger Runtime Validation

Add:

- routine H100 validation for cheap local CUDA truth
- a Modal-backed product-runtime check as the final acceptance gate
- persisted validation artifacts that agents can inspect

## Phase 3: Verifier Strength

Add:

- adversarial wrong-answer checks
- mutation tests for common failure modes
- problem-family-specific negative cases

This is how testcase quality becomes measurable instead of subjective.

## Phase 4: Sync Hardening

Tighten `sync-problems.ts` so sync fails on structural contract violations instead of only warning.

Examples:

- missing parameters
- bad method signatures
- slug/frontmatter mismatch

## Phase 5: Automated Growth

Once the contract and validators are stable:

- agents can open daily problem PRs
- CI can auto-classify failures
- maintainers can review mostly by exception
- accepted problems can auto-sync to production

## Phase 6: Contest Automation

Only after validation is trusted:

- assemble contest sets automatically
- require hidden-test quality and difficulty spread
- publish and schedule through the same validated pipeline

## Non-Goals For This Phase

- rewriting all existing problems
- forcing immediate frontmatter migration
- making Modal validation mandatory in this repo before the surrounding auth/runtime glue is ready

The immediate goal is a reliable contract and validator surface that later Modal automation can plug into.
107 changes: 107 additions & 0 deletions docs/problem-validation-contract.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Problem Validation Contract

This document defines the validation ladder for `tensara/problems`.

## Validation Tiers

### Tier 1: Structural Validation

Runs in ordinary CI without GPUs.

Checks:

- required files exist
- frontmatter has required fields
- slug matches directory
- `def.py` has a `Problem` subclass
- required methods exist
- method signatures match the stable contract
- parameters are present in `def.py` or `get_function_signature(...)` is overridden

This tier should run on every PR.

### Tier 2: Local CUDA Validation

Runs on a real GPU such as Together H100.

Checks:

- `generate_sample()` executes
- `generate_test_cases()` executes
- `reference_solution(...)` runs on CUDA
- `verify_result(...)` accepts correct outputs
- verifier rejects perturbed outputs when enabled
- `get_flops(...)` is positive when provided

This tier is the fast author-side correctness gate.

### Tier 3: Product Runtime Validation

Runs through the same runtime path as the real Tensara product.

Authoritative target:

- Modal-backed sample/checker endpoints from `tensara/tensara`

Why this is authoritative:

- it exercises the same engine loading path used by the product
- it catches signature, allocation, and runner mismatches that local file execution can miss

This is the final acceptance gate for automation.

## Source of Truth

Validation truth should be ordered as:

1. structural CI
2. local CUDA runtime
3. Modal/product runtime

Local H100 success is necessary but not sufficient. Modal runtime is the product-truth layer.

## Standard Validator Output

Validators should emit machine-readable results:

```json
{
"ok": true,
"summary": {
"problems_checked": 1,
"errors": 0,
"warnings": 1
},
"results": [
{
"slug": "relu",
"ok": true,
"diagnostics": [
{"level": "warning", "code": "missing_source_metadata", "message": "Recommended metadata not present"}
]
}
]
}
```

That output shape is chosen so agents can repair failures automatically.

## Required Checks For New Problems

New problems should pass:

- structural validation
- sample execution
- at least one generated test case
- wrong-answer rejection
- Modal sample or checker validation before merge

## Migration Policy

Backward compatibility matters, so the validator distinguishes:

- `error`: merge blocker
- `warning`: migration or quality issue
- `info`: useful metadata only

Existing published problems should initially be brought under structural validation first. Stronger runtime requirements can then tighten in phases.
Loading