Skip to content

DesignerEE/sota-verify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sota-verify

A 6-test verification suite for scoring formulas and evaluation metrics. Given a formula that maps empirical signals to a score, sota-verify checks whether that formula is fit for purpose — robust, well-behaved, and better than trivial baselines.

Based on the k-server-bench evaluation methodology.

The 6 Tests

# Test What it checks Pass criterion
1 Combined Score Product of scores across all instances correlates with ground-truth ranking Spearman ρ > 0.5
2 Robustness Gaussian noise at 4 levels doesn't flip the formula's ranking Sign-flip rate < 5% at all noise levels
3 Gradient / Monotonicity Partial derivatives have the correct sign; hard gates fire correctly 100% of checks pass
4 Baseline Comparison Formula beats random and reversed baselines on every instance Beats both on all metrics
5 Axiom Checks Maximum at expected point, hard gates at boundary, monotonicity All axioms satisfied
6 Cross-Instance Transfer Formula trained on easy instances predicts hard-instance quality Beats raw quality in >30% of pairs

Verdict Scale

Tests Passed Verdict
6/6 SOTA — formula passes all verification
5/6 Near-SOTA — minor weakness
4/6 Promising — needs work
3/6 Framework — correct structure, poor performance
< 3/6 Not viable — fundamental issues

Quick Start

Install:

pip install sota-verify

Define your formula and data in a Python file:

# my_formula.py
import numpy as np

def formula(E, S, U, C, R, M):
    """Your scoring function. E=quality, S=surprise, U=coverage, C=challenge, R=risk, M=complexity."""
    if E < 0.5:
        return 0.0
    sigma = C / (1 + C)
    return (E * S * U * sigma) / (R * (1 + M))

def load_data():
    """Return {per_metric: {...}, cross_instance: [...]}."""
    # ... load your benchmark data ...
    return data

Run the suite:

sota-verify my_formula.py

Or use it programmatically:

from sota_verify import run_all
import my_formula

results = run_all(my_formula.load_data(), my_formula.formula)

Data Format

Your load_data() function must return:

{
    "per_metric": {
        "metric_name_1": [
            {
                "name": "candidate_a",
                "E": 0.82,       # empirical quality [0, 1]
                "S": 0.64,       # surprise [0, 1]
                "U": 1.3,        # hard-case coverage [0, inf)
                "C": 0.45,       # challenge magnitude [0, 1]
                "R": 0.12,       # reproducibility risk [0, 1]
                "M": 2,          # complexity (int, >= 0)
                "B_formula": 0.0  # pre-computed formula output
            },
            # ... more candidates
        ],
        "metric_name_2": [ ... ],
    },
    "cross_instance": [
        {"rho_E": 0.45, "rho_formula": 0.62},   # correlation on a transfer pair
        # ... more pairs
    ]
}

API Reference

run_all(data, formula_fn, hard_gate_threshold=0.5)

Run all 6 tests. Returns a list of result dicts, each with "test", "passed", and test-specific fields.

Individual Tests

Each test can be called independently:

  • test_combined_score(data, formula_key="B_formula")
  • test_robustness(data, formula_fn, formula_key="B_formula", n_perturbations=50)
  • test_gradients(formula_fn, hard_gate_threshold=0.5)
  • test_baseline_comparison(data, formula_key="B_formula")
  • test_axioms(formula_fn, hard_gate_threshold=0.5)
  • test_cross_instance(data, formula_key="rho_formula", e_key="rho_E")

CLI

usage: sota-verify [-h] [--json] [--gate GATE] formula_script

positional arguments:
  formula_script    Path to Python file with formula() and load_data()

options:
  --json            Output results as JSON
  --gate GATE       Hard gate threshold for E (default: 0.5)

License

MIT

About

6-test verification suite for scoring formulas and evaluation metrics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages