metatest - Test quality evaluation using property based fault simulations
A framework for measuring REST API test quality through fault simulation. Tests are instrumented transparently via bytecode weaving, no changes to existing test code are needed. During a simulation run, HTTP responses are systematically mutated and the test suite is re-executed against each variant. Faults that the tests fail to detect are reported.
Metatest draws from two prior techniques. From mutation testing, it borrows the evaluation posture: the goal is to assess whether tests catch corruptions, not to find bugs in the implementation. From property-based testing, it borrows the specification posture: invariants are first-class definitions of what an API must always guarantee, and the mutation space is derived from them rather than from code grammar.
This is the key departure from code-level mutation testing. Property-based testing frameworks generate adversarial inputs to falsify a stated property; metatest applies the same falsification idea to API responses rather than function inputs. The invariant defines the semantic boundary, price > 0, status in [PENDING, FILLED], created_at <= updated_at, and faults are generated by constructing responses that cross it. This produces semantically meaningful mutations, not arbitrary syntactic perturbations.
In practice, you define invariants for your API endpoints in YAML files alongside your existing tests. Metatest handles instrumentation and simulation; no changes to test code or application code are required.
A tool for automated API test generation. Given an API specification, antigen uses an LLM to generate an initial test suite, then validates it through compilation, execution, and metatest fault simulation. Tests that pass execution but fail to detect injected faults are revised automatically until they meet a configurable detection threshold.
The goal is not just to generate tests that pass, but tests that would fail when the API misbehaves—a stricter and more useful standard.
Both projects are in early development. Interfaces and configuration formats should be considered unstable.
The broader interest is in the relationship between test design and defect detection. Some of the open questions:
- How can business rules be expressed concisely enough to be maintained alongside the systems they describe?
- What fault models are appropriate at different system boundaries—and how far can they be derived from existing specifications rather than hand-authored?
- Where can automated test generation reliably produce useful tests, and where does it fall short?
These are partly engineering questions and partly empirical ones. The current tools are a starting point for exploring them in a practical context.
loreval — LLM logic and reasoning evaluation on puzzle design and solving
A benchmark and CLI for measuring language model performance on grid-based spatial puzzles. Two evaluation tasks are supported: solving
puzzles the model has not seen before, and generating new puzzles given a set of constraints. Outcomes are binary — a puzzle is either
solved or it isn't, a generated puzzle is either solvable or it isn't — so no scoring rubrics or LLM-as-judge are required.
Puzzles are defined in a small domain-specific language. A level specifies a grid of tiles, a legend mapping characters to tile types, and agent declarations with start positions and goals. Mechanics include locked doors that require a matching switch to open, paint tiles that change an agent's color, one-way tiles, and multi-agent configurations where agents block each other like walls. The difficulty of a puzzle is a function of how many mechanics interact and in what order they must be resolved, making it possible to construct problems that demand multi-step causal planning rather than path-finding.
The design task is the less common of the two. Most spatial reasoning benchmarks test only the solver direction: give the model a problem,
check the answer. Testing whether a model can construct a well-formed problem — one that is solvable, uses its mechanics purposefully
rather than decoratively, and meets a difficulty specification — tests a different and arguably harder capability. A loop evaluation
combines both directions: one model designs a puzzle, another attempts to solve it. This measures whether design capability and solution
capability are correlated, within and across model families.
A BFS-based validator checks solvability and identifies which mechanics appear in the optimal solution path versus which are placed but
never required, giving a second signal on generated puzzle quality beyond parse validity.
The companion web tool (loreval.ai) provides a visual level editor for designing puzzles and exporting .lrev files
for use with the CLI.