LLM Logic and Reasoning Evaluation — a CLI for measuring language model performance on grid-based spatial puzzles.
Two evaluation dimensions:
- Solving — given a puzzle it hasn't seen before, can the model produce a valid move sequence?
- Designing — given a set of constraints, can the model produce a well-formed, solvable puzzle?
Outcomes are discrete and verifiable. A puzzle is either solved or it isn't. A generated puzzle is either solvable or it isn't. No scoring rubrics or LLM-as-judge.
Puzzles are defined in a small text-based DSL (.lrev files). A level specifies a grid of tiles, a legend mapping characters to tile types, and agent declarations with start positions and goals.
level "Switch Puzzle" 7x3
grid = [
W W W W W W W,
W R S R D G W,
W W W W W W W,
]
tile S = tiles.switch(blue)
tile D = tiles.door(blue)
tile G = tiles.goal(orange)
agent(orange) start(1,1) and reach(5,1)
The engine parses the DSL into a game state and executes moves according to fixed rules: walls block, doors open when a matching switch is hit, paint tiles change agent color, one-way tiles restrict direction, agents block each other like walls.
For solving: the DSL is sent to a Claude model with a system prompt describing the rules. The model returns a move sequence. The engine replays it and checks whether all goal agents reached their targets.
For designing: the model is given a difficulty level, grid size, and required mechanics. It returns a DSL. The engine parses and validates it with a BFS solvability check, and reports which mechanics appear in the solution vs. decoratively.
The loop command combines both: one model designs, another solves. This directly tests the core research question — can a model solve what it designs?
| DSL | Behavior |
|---|---|
W |
Wall — always impassable |
R |
Floor — always passable |
. |
Void — out of bounds |
tiles.goal(color) |
Goal — agent wins by stepping on it (color must match) |
tiles.commonGoal() |
Universal goal — any designated agent may claim it |
tiles.door(color) |
Blocked until a matching switch is activated |
tiles.switch(color) |
Opens all doors of matching color when stepped on |
tiles.paint(color) |
Changes the stepping agent's color |
tiles.one-way(dir) |
Passable only when entering from dir (up/down/left/right) |
tiles.lock(color) |
Blocked until a matching-color agent steps on it |
git clone https://github.com/your-org/loreval
cd loreval
pip install -e .API key — copy .env.example to .env and add your Anthropic key:
cp .env.example .env
# paste ANTHROPIC_API_KEY=sk-ant-... into .envOr set it as an environment variable directly:
export ANTHROPIC_API_KEY=sk-ant-...Parses the DSL and runs a BFS solvability check. Reports which mechanics are used in the solution vs. potentially decorative.
loreval validate puzzle.lrev✓ Parsed Switch Puzzle (7×3)
Agents with goal 1
Helper agents 0
Mechanics present door, goal, switch
Checking solvability…
✓ Solvable (4 moves)
Mechanics used in solution: door, goal, switch
Sends the DSL to the model, replays the returned move sequence, and reports pass/fail with move count and first failure step.
loreval solve puzzle.lrev
loreval solve puzzle.lrev --model claude-haiku-4-5-20251001 --runs 5Switch Puzzle (7×3) model=claude-sonnet-4-6 runs=3
Run 1: ✓ pass (4 moves)
Run 2: ✓ pass (4 moves)
Run 3: ✗ fail (6 moves) first failure at move 3
Pass rate: 2/3
Options:
| Flag | Default | Description |
|---|---|---|
--model, -m |
claude-sonnet-4-6 |
Model to use |
--runs, -r |
1 |
Number of attempts |
The model generates a DSL given difficulty, size, and required mechanics. The output is parsed and validated automatically.
loreval design --difficulty medium --size 8x6 --mechanics switch,door
loreval design --difficulty hard --size 10x10 --mechanics paint,lock,multi-agent --out puzzle.lrev
loreval design --prompt "a puzzle where two agents must coordinate" --size 8x8 --out collab.lrev✓ Parsed Corridor Lockdown (8×6)
Checking solvability…
✓ Solvable (18 moves)
Mechanics used: door, goal, switch
Design notes:
The orange agent must first activate the blue switch at (3,2) before it
can pass through the door at (5,2). The indirect route forces planning
ahead rather than a direct path to the goal.
✓ Saved → puzzle.lrev
Options:
| Flag | Default | Description |
|---|---|---|
--model, -m |
claude-sonnet-4-6 |
Model to use |
--difficulty, -d |
medium |
easy, medium, or hard |
--size, -s |
8x8 |
Grid size as WxH |
--mechanics |
(none) | Comma-separated: switches, doors, paint, one-way, locks, multi-agent |
--prompt, -p |
(none) | Free-form design brief |
--out, -o |
(print to stdout) | Save output to .lrev file |
One model designs a puzzle; another model tries to solve it. The most direct test of the core research question.
loreval loop
loreval loop --designer claude-opus-4-6 --solver claude-haiku-4-5-20251001 --difficulty hard --runs 5
loreval loop --designer claude-sonnet-4-6 --solver claude-sonnet-4-6 --out-dir results/Loop evaluation
Designer: claude-opus-4-6
Solver: claude-haiku-4-5-20251001
Puzzle: hard 10×10 mechanics=any
Runs: 5
✓ Designed: The Relay (10×10)
Validated solvable (31 moves)
Solver runs (5):
Run 1: ✓ pass (31 moves)
Run 2: ✗ fail (28 moves) first failure at move 14
Run 3: ✗ fail (35 moves) first failure at move 14
Run 4: ✓ pass (31 moves)
Run 5: ✗ fail (30 moves) first failure at move 14
Result: 2/5 solved
Design solvable: yes
Options:
| Flag | Default | Description |
|---|---|---|
--designer |
claude-sonnet-4-6 |
Model that designs the puzzle |
--solver |
claude-sonnet-4-6 |
Model that solves the puzzle |
--difficulty |
medium |
easy, medium, or hard |
--size |
8x8 |
Grid size |
--mechanics |
(none) | Required mechanics |
--runs |
3 |
Solve attempts per designed puzzle |
--out-dir |
(none) | Save designed puzzles to directory |
Runs solve or design evaluations over a directory of .lrev files and writes results to CSV.
# solve: run all .lrev files in benchmark/
loreval eval benchmark/ --model claude-haiku-4-5-20251001 --runs 3 --out results.csv
# design: generate N puzzles and validate each
loreval eval benchmark/ --task design --model claude-sonnet-4-6 --runs 10 --out design-results.csvOutput columns (solve):
| Column | Description |
|---|---|
file |
Puzzle filename |
model |
Model used |
run |
Attempt number |
outcome |
pass, fail, or error |
moves |
Total moves in the model's response |
first_failure_step |
Index of first blocked/wrong-position move |
failure_reason |
wrong_position, blocked, incomplete, or empty |
Output columns (design):
| Column | Description |
|---|---|
parse_success |
Whether the DSL parsed without errors |
is_solvable |
True, False, or None (state cap hit) |
mechanics_required |
Pipe-separated mechanics used in solution |
mechanics_decorative |
Pipe-separated mechanics not needed to solve |
solution_length |
BFS solution move count |
Options:
| Flag | Default | Description |
|---|---|---|
--model, -m |
claude-sonnet-4-6 |
Model to use |
--task, -t |
solve |
solve or design |
--runs, -r |
1 |
Attempts per file (solve) or puzzles to generate (design) |
--out, -o |
results.csv |
Output CSV path |
| ID | Name |
|---|---|
claude-haiku-4-5-20251001 |
Claude Haiku 4.5 |
claude-sonnet-4-6 |
Claude Sonnet 4.6 (default) |
claude-opus-4-6 |
Claude Opus 4.6 |
loreval/
engine/
models.py dataclasses: Level, Tile, Entity, GameState
parser.py .lrev text → Level
game.py move execution, passability, win condition
validator.py BFS solvability check (cap: 200k states)
ai/
prompts.py solve + design system prompts, message builders, output parsers
client.py Anthropic SDK wrapper
cli/
validate.py loreval validate
solve.py loreval solve
design.py loreval design
loop.py loreval loop
eval.py loreval eval