sota-verify

A 6-test verification suite for scoring formulas and evaluation metrics. Given a formula that maps empirical signals to a score, sota-verify checks whether that formula is fit for purpose — robust, well-behaved, and better than trivial baselines.

Based on the k-server-bench evaluation methodology.

The 6 Tests

#	Test	What it checks	Pass criterion
1	Combined Score	Product of scores across all instances correlates with ground-truth ranking	Spearman ρ > 0.5
2	Robustness	Gaussian noise at 4 levels doesn't flip the formula's ranking	Sign-flip rate < 5% at all noise levels
3	Gradient / Monotonicity	Partial derivatives have the correct sign; hard gates fire correctly	100% of checks pass
4	Baseline Comparison	Formula beats random and reversed baselines on every instance	Beats both on all metrics
5	Axiom Checks	Maximum at expected point, hard gates at boundary, monotonicity	All axioms satisfied
6	Cross-Instance Transfer	Formula trained on easy instances predicts hard-instance quality	Beats raw quality in >30% of pairs

Verdict Scale

Tests Passed	Verdict
6/6	SOTA — formula passes all verification
5/6	Near-SOTA — minor weakness
4/6	Promising — needs work
3/6	Framework — correct structure, poor performance
< 3/6	Not viable — fundamental issues

Quick Start

Install:

pip install sota-verify

Define your formula and data in a Python file:

# my_formula.py
import numpy as np

def formula(E, S, U, C, R, M):
    """Your scoring function. E=quality, S=surprise, U=coverage, C=challenge, R=risk, M=complexity."""
    if E < 0.5:
        return 0.0
    sigma = C / (1 + C)
    return (E * S * U * sigma) / (R * (1 + M))

def load_data():
    """Return {per_metric: {...}, cross_instance: [...]}."""
    # ... load your benchmark data ...
    return data

Run the suite:

sota-verify my_formula.py

Or use it programmatically:

from sota_verify import run_all
import my_formula

results = run_all(my_formula.load_data(), my_formula.formula)

Data Format

Your load_data() function must return:

{
    "per_metric": {
        "metric_name_1": [
            {
                "name": "candidate_a",
                "E": 0.82,       # empirical quality [0, 1]
                "S": 0.64,       # surprise [0, 1]
                "U": 1.3,        # hard-case coverage [0, inf)
                "C": 0.45,       # challenge magnitude [0, 1]
                "R": 0.12,       # reproducibility risk [0, 1]
                "M": 2,          # complexity (int, >= 0)
                "B_formula": 0.0  # pre-computed formula output
            },
            # ... more candidates
        ],
        "metric_name_2": [ ... ],
    },
    "cross_instance": [
        {"rho_E": 0.45, "rho_formula": 0.62},   # correlation on a transfer pair
        # ... more pairs
    ]
}

API Reference

`run_all(data, formula_fn, hard_gate_threshold=0.5)`

Run all 6 tests. Returns a list of result dicts, each with "test", "passed", and test-specific fields.

Individual Tests

Each test can be called independently:

test_combined_score(data, formula_key="B_formula")
test_robustness(data, formula_fn, formula_key="B_formula", n_perturbations=50)
test_gradients(formula_fn, hard_gate_threshold=0.5)
test_baseline_comparison(data, formula_key="B_formula")
test_axioms(formula_fn, hard_gate_threshold=0.5)
test_cross_instance(data, formula_key="rho_formula", e_key="rho_E")

CLI

usage: sota-verify [-h] [--json] [--gate GATE] formula_script

positional arguments:
  formula_script    Path to Python file with formula() and load_data()

options:
  --json            Output results as JSON
  --gate GATE       Hard gate threshold for E (default: 0.5)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
src/sota_verify		src/sota_verify
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sota-verify

The 6 Tests

Verdict Scale

Quick Start

Data Format

API Reference

`run_all(data, formula_fn, hard_gate_threshold=0.5)`

Individual Tests

CLI

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

sota-verify

The 6 Tests

Verdict Scale

Quick Start

Data Format

API Reference

run_all(data, formula_fn, hard_gate_threshold=0.5)

Individual Tests

CLI

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`run_all(data, formula_fn, hard_gate_threshold=0.5)`

Packages