A 6-test verification suite for scoring formulas and evaluation metrics. Given a formula that maps empirical signals to a score, sota-verify checks whether that formula is fit for purpose — robust, well-behaved, and better than trivial baselines.
Based on the k-server-bench evaluation methodology.
| # | Test | What it checks | Pass criterion |
|---|---|---|---|
| 1 | Combined Score | Product of scores across all instances correlates with ground-truth ranking | Spearman ρ > 0.5 |
| 2 | Robustness | Gaussian noise at 4 levels doesn't flip the formula's ranking | Sign-flip rate < 5% at all noise levels |
| 3 | Gradient / Monotonicity | Partial derivatives have the correct sign; hard gates fire correctly | 100% of checks pass |
| 4 | Baseline Comparison | Formula beats random and reversed baselines on every instance | Beats both on all metrics |
| 5 | Axiom Checks | Maximum at expected point, hard gates at boundary, monotonicity | All axioms satisfied |
| 6 | Cross-Instance Transfer | Formula trained on easy instances predicts hard-instance quality | Beats raw quality in >30% of pairs |
| Tests Passed | Verdict |
|---|---|
| 6/6 | SOTA — formula passes all verification |
| 5/6 | Near-SOTA — minor weakness |
| 4/6 | Promising — needs work |
| 3/6 | Framework — correct structure, poor performance |
| < 3/6 | Not viable — fundamental issues |
Install:
pip install sota-verifyDefine your formula and data in a Python file:
# my_formula.py
import numpy as np
def formula(E, S, U, C, R, M):
"""Your scoring function. E=quality, S=surprise, U=coverage, C=challenge, R=risk, M=complexity."""
if E < 0.5:
return 0.0
sigma = C / (1 + C)
return (E * S * U * sigma) / (R * (1 + M))
def load_data():
"""Return {per_metric: {...}, cross_instance: [...]}."""
# ... load your benchmark data ...
return dataRun the suite:
sota-verify my_formula.pyOr use it programmatically:
from sota_verify import run_all
import my_formula
results = run_all(my_formula.load_data(), my_formula.formula)Your load_data() function must return:
{
"per_metric": {
"metric_name_1": [
{
"name": "candidate_a",
"E": 0.82, # empirical quality [0, 1]
"S": 0.64, # surprise [0, 1]
"U": 1.3, # hard-case coverage [0, inf)
"C": 0.45, # challenge magnitude [0, 1]
"R": 0.12, # reproducibility risk [0, 1]
"M": 2, # complexity (int, >= 0)
"B_formula": 0.0 # pre-computed formula output
},
# ... more candidates
],
"metric_name_2": [ ... ],
},
"cross_instance": [
{"rho_E": 0.45, "rho_formula": 0.62}, # correlation on a transfer pair
# ... more pairs
]
}Run all 6 tests. Returns a list of result dicts, each with "test", "passed", and test-specific fields.
Each test can be called independently:
test_combined_score(data, formula_key="B_formula")test_robustness(data, formula_fn, formula_key="B_formula", n_perturbations=50)test_gradients(formula_fn, hard_gate_threshold=0.5)test_baseline_comparison(data, formula_key="B_formula")test_axioms(formula_fn, hard_gate_threshold=0.5)test_cross_instance(data, formula_key="rho_formula", e_key="rho_E")
usage: sota-verify [-h] [--json] [--gate GATE] formula_script
positional arguments:
formula_script Path to Python file with formula() and load_data()
options:
--json Output results as JSON
--gate GATE Hard gate threshold for E (default: 0.5)
MIT