tensara · saarang123 · Mar 31, 2026
diff --git a/.github/workflows/validate-problems.yml b/.github/workflows/validate-problems.yml
@@ -0,0 +1,20 @@
+name: Validate Problems
+
+on:
+  pull_request:
+  push:
+    branches: ["main"]
+
+jobs:
+  contract:
+    runs-on: ubuntu-latest
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+
+      - name: Structural validation
+        run: python scripts/validate_problem.py --runtime none --format text
diff --git a/README.md b/README.md
@@ -41,3 +41,30 @@ The `problem.md` file should contain a description of the problem written in Mar
   - `const`: boolean
 
 Once you add a problem, make sure to test both correct (slow/fast) and incorrect submissions. Let us know if you encounter any issues/bugs!
+
+## Contract And Validation
+
+For new work, use the contract-first docs in this repo:
+
+- `docs/problem-authoring-contract.md`
+- `docs/problem-validation-contract.md`
+- `docs/problem-automation-roadmap.md`
+
+There is now a machine-readable validator:
+
+```bash
+python scripts/validate_problem.py --runtime none
+python scripts/validate_problem.py relu softmax --runtime first
+python scripts/validate_problem.py gemm-relu-divide --runtime all --enforce-wrong-answer-rejection
+```
+
+Recommended workflow:
+
+1. pass structural validation in CI
+2. validate locally on CUDA or Together H100
+3. validate through the Modal-backed product runtime before merge when the runtime path matters
+
+Templates for new problems live in:
+
+- `templates/problem-template.def.py`
+- `templates/problem-template.md`
diff --git a/docs/problem-authoring-contract.md b/docs/problem-authoring-contract.md
@@ -0,0 +1,149 @@
+# Problem Authoring Contract
+
+This document defines the backward-compatible authoring format for `tensara/problems`.
+
+## Goals
+
+- Make problem authoring deterministic for agents.
+- Preserve compatibility with existing published problems.
+- Separate authoring truth from runtime truth.
+- Keep `sync-problems.ts` compatible while stricter validation is rolled out.
+
+## Stable Problem Layout
+
+Each problem lives in:
+
+```text
+problems/<slug>/
+├── def.py
+└── problem.md
+```
+
+Both files remain required.
+
+## `problem.md` Contract
+
+### Required Frontmatter
+
+These fields are already expected by sync and remain required:
+
+```yaml
+slug: "relu"
+title: "ReLU"
+difficulty: "EASY"
+author: "sarthak"
+```
+
+### Recommended Frontmatter
+
+These are now the recommended fields for new problems. They are backward-compatible because unknown frontmatter is ignored by the current sync path.
+
+```yaml
+tags: ["activation-function"]
+source:
+  kind: "kernelbench"
+  repo: "ScalingIntelligence/KernelBench"
+  path: "KernelBench/level2/63_Gemm_ReLU_Divide.py"
+authoring:
+  mode: "exact-port"   # exact-port | normalized
+validation:
+  deterministic: true
+  sample_path: true
+  wrong_answer_rejection: true
+  runtime_targets: ["local-cuda", "modal-sample"]
+```
+
+### Content Rules
+
+- `slug` must match the directory name.
+- `difficulty` must be `EASY`, `MEDIUM`, or `HARD`.
+- Markdown body should describe the mathematical contract, not implementation trivia.
+- If the problem is adapted from an external source, include attribution in the body or `source` block.
+
+## `def.py` Contract
+
+### Required Class Shape
+
+`def.py` must define one primary problem class that subclasses `Problem`.
+
+Canonical class naming:
+
+- directory slug: `gemm-relu-divide`
+- class name: `gemm_relu_divide`
+
+### Required Methods
+
+New problems should implement:
+
+- `reference_solution(self, *args)`
+- `generate_test_cases(self)`
+- `generate_sample(self)`
+- `verify_result(self, expected_output, actual_output)`
+
+Backward-compatible rule:
+
+- existing problems are allowed to keep current behavior if they already work
+- new validation treats extra required args on `generate_test_cases`, `generate_sample`, or `verify_result` as contract errors
+
+### Parameters
+
+Preferred:
+
+- define `parameters = [...]` in `def.py`
+
+Fallback still supported:
+
+- override `get_function_signature(...)`
+- include legacy `parameters` frontmatter in `problem.md`
+
+New problems should define parameters in `def.py`, because that is the most machine-readable source for agents and validators.
+
+### Test Case Shape
+
+`generate_test_cases()` should return a list of dicts.
+
+Each dict should contain:
+
+- `name`: stable string label
+- `create_inputs`: zero-arg callable returning the reference inputs
+
+Additional keys such as dimensions, seed, or descriptive metadata are encouraged.
+
+### Sample Shape
+
+Canonical form:
+
+- `generate_sample()` returns a single dict with the same shape as one test case
+
+Backward-compatible form still accepted by the validator:
+
+- a list containing exactly one dict
+
+### Verifier Rules
+
+`verify_result(...)` should:
+
+- accept the exact reference output
+- reject intentionally perturbed outputs
+- return `(bool, debug_info)`
+- provide debug info that is useful for automated repair
+
+## Canonical Authoring Pattern
+
+1. Define the mathematical contract in `problem.md`.
+2. Define parameters in `def.py`.
+3. Make `reference_solution(...)` deterministic.
+4. Use `Problem.get_seed(...)` for seeded test generation.
+5. Add one small sample case.
+6. Add several larger generated cases.
+7. Make verifier failures explicit and debuggable.
+
+## Compatibility Policy
+
+This contract is intentionally additive:
+
+- existing frontmatter fields keep working
+- old problems are not forced to adopt new optional metadata immediately
+- validation distinguishes structural errors from migration warnings
+
+The long-term direction is to move all new problems toward this contract and then tighten sync around it.
diff --git a/docs/problem-automation-roadmap.md b/docs/problem-automation-roadmap.md
@@ -0,0 +1,78 @@
+# Problem Automation Roadmap
+
+This roadmap defines how `tensara/problems` becomes agent-friendly without breaking current workflows.
+
+## Immediate Direction
+
+The first priority is not contests. It is deterministic ingestion.
+
+That means:
+
+1. agents can author to a stable contract
+2. validators can reject weak or broken problems automatically
+3. accepted problems are safe to sync and publish
+
+## Phase 1: Contract First
+
+Ship:
+
+- a stable authoring contract
+- a stable validation contract
+- a machine-readable validator
+- structural CI on every PR
+- templates for new problems
+
+This PR implements Phase 1.
+
+## Phase 2: Stronger Runtime Validation
+
+Add:
+
+- routine H100 validation for cheap local CUDA truth
+- a Modal-backed product-runtime check as the final acceptance gate
+- persisted validation artifacts that agents can inspect
+
+## Phase 3: Verifier Strength
+
+Add:
+
+- adversarial wrong-answer checks
+- mutation tests for common failure modes
+- problem-family-specific negative cases
+
+This is how testcase quality becomes measurable instead of subjective.
+
+## Phase 4: Sync Hardening
+
+Tighten `sync-problems.ts` so sync fails on structural contract violations instead of only warning.
+
+Examples:
+
+- missing parameters
+- bad method signatures
+- slug/frontmatter mismatch
+
+## Phase 5: Automated Growth
+
+Once the contract and validators are stable:
+
+- agents can open daily problem PRs
+- CI can auto-classify failures
+- maintainers can review mostly by exception
+- accepted problems can auto-sync to production
+
+## Phase 6: Contest Automation
+
+Only after validation is trusted:
+
+- assemble contest sets automatically
+- require hidden-test quality and difficulty spread
+- publish and schedule through the same validated pipeline
+
+## Non-Goals For This Phase
+
+- rewriting all existing problems
+- forcing immediate frontmatter migration
+- making Modal validation mandatory in this repo before the surrounding auth/runtime glue is ready
+
+The immediate goal is a reliable contract and validator surface that later Modal automation can plug into.
diff --git a/docs/problem-validation-contract.md b/docs/problem-validation-contract.md
@@ -0,0 +1,107 @@
+# Problem Validation Contract
+
+This document defines the validation ladder for `tensara/problems`.
+
+## Validation Tiers
+
+### Tier 1: Structural Validation
+
+Runs in ordinary CI without GPUs.
+
+Checks:
+
+- required files exist
+- frontmatter has required fields
+- slug matches directory
+- `def.py` has a `Problem` subclass
+- required methods exist
+- method signatures match the stable contract
+- parameters are present in `def.py` or `get_function_signature(...)` is overridden
+
+This tier should run on every PR.
+
+### Tier 2: Local CUDA Validation
+
+Runs on a real GPU such as Together H100.
+
+Checks:
+
+- `generate_sample()` executes
+- `generate_test_cases()` executes
+- `reference_solution(...)` runs on CUDA
+- `verify_result(...)` accepts correct outputs
+- verifier rejects perturbed outputs when enabled
+- `get_flops(...)` is positive when provided
+
+This tier is the fast author-side correctness gate.
+
+### Tier 3: Product Runtime Validation
+
+Runs through the same runtime path as the real Tensara product.
+
+Authoritative target:
+
+- Modal-backed sample/checker endpoints from `tensara/tensara`
+
+Why this is authoritative:
+
+- it exercises the same engine loading path used by the product
+- it catches signature, allocation, and runner mismatches that local file execution can miss
+
+This is the final acceptance gate for automation.
+
+## Source of Truth
+
+Validation truth should be ordered as:
+
+1. structural CI
+2. local CUDA runtime
+3. Modal/product runtime
+
+Local H100 success is necessary but not sufficient. Modal runtime is the product-truth layer.
+
+## Standard Validator Output
+
+Validators should emit machine-readable results:
+
+```json
+{
+  "ok": true,
+  "summary": {
+    "problems_checked": 1,
+    "errors": 0,
+    "warnings": 1
+  },
+  "results": [
+    {
+      "slug": "relu",
+      "ok": true,
+      "diagnostics": [
+        {"level": "warning", "code": "missing_source_metadata", "message": "Recommended metadata not present"}
+      ]
+    }
+  ]
+}
+```
+
+That output shape is chosen so agents can repair failures automatically.
+
+## Required Checks For New Problems
+
+New problems should pass:
+
+- structural validation
+- sample execution
+- at least one generated test case
+- wrong-answer rejection
+- Modal sample or checker validation before merge
+
+## Migration Policy
+
+Backward compatibility matters, so the validator distinguishes:
+
+- `error`: merge blocker
+- `warning`: migration or quality issue
+- `info`: useful metadata only
+
+Existing published problems should initially be brought under structural validation first. Stronger runtime requirements can then tighten in phases.