eval-kit

Statistical confidence for LLM evaluation. Welch's t-test, Glass's delta, power analysis. Two language bindings, cross-validated against scipy.

What it is

A small library for the math you need when comparing LLM outputs:

Are these two runs actually different, or just sampled differently from the same distribution?
How many runs do you need to reliably detect a 5-point shift?
Is my baseline precise enough to trust the comparison at all?

Two bindings, same API surface:

JS (js/lib/stats.mjs) -- Node.js built-ins only, zero dependencies, ~260 lines with JSDoc.
Python (python/src/eval_kit/) -- scipy-backed, exact. Install from source.

Cross-validated against scipy on every commit (tests/cross-validation/).

Install

JavaScript (Node 20+, no install needed -- copy the single file):

curl -O https://raw.githubusercontent.com/TracineHQ/eval-kit/main/js/lib/stats.mjs

Python (install from source):

git clone https://github.com/TracineHQ/eval-kit
cd eval-kit/python && pip install -e .

Quick start

JavaScript:

import { stats, welchTTest, requiredN } from './stats.mjs';

const baseline = [82, 85, 79, 88, 81, 84, 86, 80, 83, 87];
const variant  = [75, 78, 72, 80, 74, 77, 79, 73, 76, 81];

console.log(welchTTest(baseline, variant));
// { t, df, p, diff, se, glassD, baselineStd }

console.log(requiredN(stats(baseline).std, 5));
// runs per group needed to detect a 5-point shift

Python:

from eval_kit import descriptive_stats, welch_t_test, required_n

baseline = [82, 85, 79, 88, 81, 84, 86, 80, 83, 87]
variant  = [75, 78, 72, 80, 74, 77, 79, 73, 76, 81]

print(welch_t_test(baseline, variant))
# WelchResult(t=..., df=..., p=..., diff=..., se=..., glass_d=..., baseline_std=...)

print(required_n(descriptive_stats(baseline).std, 5))
# runs per group needed to detect a 5-point shift

API parity

Both bindings expose the same functions. JS uses camelCase; Python uses snake_case. Return field names follow the same convention.

Purpose	JS	Python	Returns
Descriptive stats + 95% CI	`stats(arr)`	`descriptive_stats(arr)`	`{n, mean, std, cv, min, max, range, se, ciLo, ciHi, ciMargin}`
Welch's t-test + Glass's delta	`welchTTest(a, b)`	`welch_t_test(a, b)`	`{t, df, p, diff, se, glassD, baselineStd}`
Required sample size (power analysis)	`requiredN(std, delta)`	`required_n(std, delta)`	`number`
Two-tailed p-value from t-statistic	`approxPValue(absT, df)`	`approx_p_value(abs_t, df)`	`number`
t-critical for 95% CI	`tCritical(df)`	`t_critical(df)`	`number`

Precision difference: Python is exact via scipy. JS approximates p-values for df < 30 using bucketed thresholds (one of {0.0001, 0.01, 0.05, 0.1, 0.5}) and uses the Abramowitz & Stegun normal CDF for df >= 30. Cross-validation tests pin the maximum divergence at factor 3x on bucketed values and 1e-2 on A&S-vs-exact p-values.

Notable design choices

Glass's delta, not Cohen's d. Cohen's d pools both groups' standard deviations, assuming equal variance. That is inconsistent with using Welch's in the first place. Glass's delta uses only the baseline group's standard deviation.
Power analysis includes a factor of 2 for two-sample comparison. Without it, the required sample size is underestimated by half.
CV uses Math.abs(mean) to avoid sign errors near zero.
Iterative min and max rather than the spread operator, so large arrays do not blow the call stack.
Sentinels over Infinity. glassD = 99 when baseline std is 0 (Infinity breaks JSON serialization).
p-value approximation documents its small-sample limitations. Use a real t-distribution CDF (jStat, scipy, or the Python binding) when you need precision below n=30.

Monorepo layout

js/
  lib/stats.mjs                JS binding (zero deps, Node 20+)
  test/stats.test.mjs          JS unit tests (node:test)
python/
  pyproject.toml               Python package manifest
  src/eval_kit/stats.py    Python binding (numpy + scipy)
  tests/test_stats.py          Python unit tests (pytest, mirrors JS 1:1)
tests/
  cross-validation/            Parity + scipy ground truth

The Python binding is the reference implementation. The JS binding is cross-validated against it. See AGENTS.md for the parity rule that governs contributions.

Tests

# JS (22 tests)
node --test js/test/stats.test.mjs

# Python (22 tests, from python/)
cd python && pip install -e ".[dev]" && pytest

# Cross-validation (51 tests, from python/, requires node in PATH)
cd python && pytest ../tests/cross-validation/ -v

All three suites are wired into CI on every push and pull request to main.

Contributing

See CONTRIBUTING.md. The parity rule: every function must exist in both bindings with mirrored tests and cross-validation coverage.

License

Apache 2.0 -- see LICENSE and NOTICE.

Author

Anthony Ledesma. I build infrastructure for safe, observable LLM development.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.claude-plugin		.claude-plugin
.github/workflows		.github/workflows
js		js
lib		lib
python		python
tests/cross-validation		tests/cross-validation
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eval-kit

What it is

Install

Quick start

API parity

Notable design choices

Monorepo layout

Tests

Contributing

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

eval-kit

What it is

Install

Quick start

API parity

Notable design choices

Monorepo layout

Tests

Contributing

License

Author

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages