Skip to content

msyvr/emmy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Emmy

Evaluation-invariant measurement for multi-agent AI systems.

Status: Pre-experiment. The program overview is the canonical reference. Concrete work so far: the Phase 0 estimator calibrations in phase0/, and a smoke-test of the invariance idea in demo/.

The gap

AI agents increasingly work in groups — language-model agents that call tools and coordinate with one another, reinforcement-learning agents acting in a shared environment. These collectives are already moving into real-world deployment.

We have little settled practice for measuring what such a group does collectively. The closest tools look elsewhere: single-model methods — evaluation, interpretability, AI control — inspect one model at a time, and reveal little about the behavior of the group. The field is starting to publish collective metrics, but each is reported on one setup — a score tied to a particular environment, a signal defined for a single experiment — and whether those numbers survive a change of setup is not tested, so they rarely carry from one paper to the next, or from a lab setup to a deployment. And the reporting conventions that downstream safety and evaluation work will inherit are taking root now — before the measurement practice supporting them is sound.

The approach

Emmy is a research program building the measurement foundations for groups of AI agents. Its approach is atypical: instead of starting from an anthropomorphized characterization of what the agents are doing — cooperating, competing, deceiving — and building a proxy metric around it, emmy starts from quantities measurable from behavior and asks what they reveal about safety and alignment.

The field is now publishing collective-behavior metrics for LLM-agent systems — fragility/antifragility, misalignment propensity, multi-agent evaluation suites, interaction-graph measures. Emmy takes that battery of published metrics and characterizes, for each, two things: how much it travels across setup changes that should not move a genuine collective property (invariance), and whether it tracks a collective property that joint task-performance cannot separate (construct validity). Underneath, these are one question: does the metric track something that belongs to the group itself, rather than the particular setup it was run in? Where a metric holds on both, two payoffs follow: claims about coordination, robustness, and failure become comparable across papers, and an external evaluator gains a way to inspect a deployed group directly — the inspection layer single-model methods aren't equipped to provide.

The measurement depends only on agents' actions and observations, so it needs no privileged access to the underlying models — the surface a third-party evaluator actually has. And every metric is calibrated first against synthetic systems with known ground truth, so a printed estimator-noise floor makes each result interpretable rather than an artifact of estimation.

This is pre-experiment work; the claims are not yet validated. It measures behavior rather than internal cognition, so what a positive identifies is a behavioral disposition — a property of the group, not its latent intent (a principled limit of behavioral evaluation, and one that weakens further against agents optimizing to fool the measurement). It builds on metrics the field has already published, and is complementary to benchmark evaluation and interpretability.

Phase 0 — estimator calibration

Before any of these metrics is run on real LLM-agent rollouts, its estimator is calibrated against a synthetic source whose value is known in closed form — measuring estimator bias and the sampling noise floor directly, at ~zero compute. phase0/ calibrates all three battery metrics:

  • coordination (conditional mutual information) — recovers the known value to ~0.001 bits at N=10,000, with the noise floor printed (0.025 bits);
  • fragility / antifragility (response curvature) — recovers the sign and magnitude; the floor sets the budget at which fragile-vs-antifragile is callable;
  • misalignment-propagation (contagion) — recovers the coefficient; the floor sets the budget at which faint contagion is detectable from zero.

Each result is the estimator's resolution limit — the floor under which a later "this metric does or does not travel across setups" finding is interpretable rather than an estimation artifact. The next phase runs these calibrated estimators on small LLM-agent teams (the invariance sweep). See phase0/README.md.

Demo — invariance under reward rescaling

demo/ is a small, runnable smoke-test of the measurement machinery on the simplest, provable case — reward rescaling, where behavioral invariance follows from policy-invariance: two tabular Q-learning agents in the iterated prisoner's dilemma, showing that behavioral observables (coordination, action autocorrelation) are invariant under reward rescaling while reward-based quantities are not. It illustrates the invariance question on the one corner where the answer is provable — a smoke-test of the machinery, not a research result. See demo/README.md.

Emmy Noether

After Emmy Noether (1882–1935), whose foundational work connecting symmetries to invariants underlies the framing of evaluation-invariant measurement.

License

Apache 2.0 — see LICENSE.

About

Evaluation-invariant measurement for multi-agent AI systems.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages