loreval

LLM Logic and Reasoning Evaluation — a CLI for measuring language model performance on grid-based spatial puzzles.

Two evaluation dimensions:

Solving — given a puzzle it hasn't seen before, can the model produce a valid move sequence?
Designing — given a set of constraints, can the model produce a well-formed, solvable puzzle?

Outcomes are discrete and verifiable. A puzzle is either solved or it isn't. A generated puzzle is either solvable or it isn't. No scoring rubrics or LLM-as-judge.

How it works

Puzzles are defined in a small text-based DSL (.lrev files). A level specifies a grid of tiles, a legend mapping characters to tile types, and agent declarations with start positions and goals.

level "Switch Puzzle" 7x3

grid = [
  W W W W W W W,
  W R S R D G W,
  W W W W W W W,
]

tile S = tiles.switch(blue)
tile D = tiles.door(blue)
tile G = tiles.goal(orange)

agent(orange) start(1,1) and reach(5,1)

The engine parses the DSL into a game state and executes moves according to fixed rules: walls block, doors open when a matching switch is hit, paint tiles change agent color, one-way tiles restrict direction, agents block each other like walls.

For solving: the DSL is sent to a Claude model with a system prompt describing the rules. The model returns a move sequence. The engine replays it and checks whether all goal agents reached their targets.

For designing: the model is given a difficulty level, grid size, and required mechanics. It returns a DSL. The engine parses and validates it with a BFS solvability check, and reports which mechanics appear in the solution vs. decoratively.

The loop command combines both: one model designs, another solves. This directly tests the core research question — can a model solve what it designs?

Tile types

DSL	Behavior
`W`	Wall — always impassable
`R`	Floor — always passable
`.`	Void — out of bounds
`tiles.goal(color)`	Goal — agent wins by stepping on it (color must match)
`tiles.commonGoal()`	Universal goal — any designated agent may claim it
`tiles.door(color)`	Blocked until a matching switch is activated
`tiles.switch(color)`	Opens all doors of matching color when stepped on
`tiles.paint(color)`	Changes the stepping agent's color
`tiles.one-way(dir)`	Passable only when entering from `dir` (up/down/left/right)
`tiles.lock(color)`	Blocked until a matching-color agent steps on it

Install

git clone https://github.com/your-org/loreval
cd loreval
pip install -e .

API key — copy .env.example to .env and add your Anthropic key:

cp .env.example .env
# paste ANTHROPIC_API_KEY=sk-ant-... into .env

Or set it as an environment variable directly:

export ANTHROPIC_API_KEY=sk-ant-...

Commands

`validate` — check a puzzle file

Parses the DSL and runs a BFS solvability check. Reports which mechanics are used in the solution vs. potentially decorative.

loreval validate puzzle.lrev

✓ Parsed  Switch Puzzle  (7×3)
  Agents with goal     1
  Helper agents        0
  Mechanics present    door, goal, switch

Checking solvability…
✓ Solvable  (4 moves)
  Mechanics used in solution: door, goal, switch

`solve` — run a model against a puzzle

Sends the DSL to the model, replays the returned move sequence, and reports pass/fail with move count and first failure step.

loreval solve puzzle.lrev
loreval solve puzzle.lrev --model claude-haiku-4-5-20251001 --runs 5

Switch Puzzle  (7×3)  model=claude-sonnet-4-6  runs=3

  Run 1: ✓ pass  (4 moves)
  Run 2: ✓ pass  (4 moves)
  Run 3: ✗ fail  (6 moves)  first failure at move 3

Pass rate: 2/3

Options:

Flag	Default	Description
`--model`, `-m`	`claude-sonnet-4-6`	Model to use
`--runs`, `-r`	`1`	Number of attempts

`design` — ask a model to create a puzzle

The model generates a DSL given difficulty, size, and required mechanics. The output is parsed and validated automatically.

loreval design --difficulty medium --size 8x6 --mechanics switch,door
loreval design --difficulty hard --size 10x10 --mechanics paint,lock,multi-agent --out puzzle.lrev
loreval design --prompt "a puzzle where two agents must coordinate" --size 8x8 --out collab.lrev

✓ Parsed  Corridor Lockdown  (8×6)
Checking solvability…
✓ Solvable  (18 moves)
  Mechanics used: door, goal, switch

Design notes:
The orange agent must first activate the blue switch at (3,2) before it
can pass through the door at (5,2). The indirect route forces planning
ahead rather than a direct path to the goal.

✓ Saved → puzzle.lrev

Options:

Flag	Default	Description
`--model`, `-m`	`claude-sonnet-4-6`	Model to use
`--difficulty`, `-d`	`medium`	`easy`, `medium`, or `hard`
`--size`, `-s`	`8x8`	Grid size as `WxH`
`--mechanics`	(none)	Comma-separated: `switches`, `doors`, `paint`, `one-way`, `locks`, `multi-agent`
`--prompt`, `-p`	(none)	Free-form design brief
`--out`, `-o`	(print to stdout)	Save output to `.lrev` file

`loop` — design then solve

One model designs a puzzle; another model tries to solve it. The most direct test of the core research question.

loreval loop
loreval loop --designer claude-opus-4-6 --solver claude-haiku-4-5-20251001 --difficulty hard --runs 5
loreval loop --designer claude-sonnet-4-6 --solver claude-sonnet-4-6 --out-dir results/

Loop evaluation
  Designer: claude-opus-4-6
  Solver:   claude-haiku-4-5-20251001
  Puzzle:   hard  10×10  mechanics=any
  Runs:     5

✓ Designed: The Relay  (10×10)
  Validated solvable (31 moves)

Solver runs (5):
  Run 1: ✓ pass  (31 moves)
  Run 2: ✗ fail  (28 moves)  first failure at move 14
  Run 3: ✗ fail  (35 moves)  first failure at move 14
  Run 4: ✓ pass  (31 moves)
  Run 5: ✗ fail  (30 moves)  first failure at move 14

Result:  2/5 solved
Design solvable: yes

Options:

Flag	Default	Description
`--designer`	`claude-sonnet-4-6`	Model that designs the puzzle
`--solver`	`claude-sonnet-4-6`	Model that solves the puzzle
`--difficulty`	`medium`	`easy`, `medium`, or `hard`
`--size`	`8x8`	Grid size
`--mechanics`	(none)	Required mechanics
`--runs`	`3`	Solve attempts per designed puzzle
`--out-dir`	(none)	Save designed puzzles to directory

`eval` — batch evaluation

Runs solve or design evaluations over a directory of .lrev files and writes results to CSV.

# solve: run all .lrev files in benchmark/
loreval eval benchmark/ --model claude-haiku-4-5-20251001 --runs 3 --out results.csv

# design: generate N puzzles and validate each
loreval eval benchmark/ --task design --model claude-sonnet-4-6 --runs 10 --out design-results.csv

Output columns (solve):

Column	Description
`file`	Puzzle filename
`model`	Model used
`run`	Attempt number
`outcome`	`pass`, `fail`, or `error`
`moves`	Total moves in the model's response
`first_failure_step`	Index of first blocked/wrong-position move
`failure_reason`	`wrong_position`, `blocked`, `incomplete`, or empty

Output columns (design):

Column	Description
`parse_success`	Whether the DSL parsed without errors
`is_solvable`	`True`, `False`, or `None` (state cap hit)
`mechanics_required`	Pipe-separated mechanics used in solution
`mechanics_decorative`	Pipe-separated mechanics not needed to solve
`solution_length`	BFS solution move count

Options:

Flag	Default	Description
`--model`, `-m`	`claude-sonnet-4-6`	Model to use
`--task`, `-t`	`solve`	`solve` or `design`
`--runs`, `-r`	`1`	Attempts per file (solve) or puzzles to generate (design)
`--out`, `-o`	`results.csv`	Output CSV path

Supported models

ID	Name
`claude-haiku-4-5-20251001`	Claude Haiku 4.5
`claude-sonnet-4-6`	Claude Sonnet 4.6 (default)
`claude-opus-4-6`	Claude Opus 4.6

Project structure

loreval/
  engine/
    models.py     dataclasses: Level, Tile, Entity, GameState
    parser.py     .lrev text → Level
    game.py       move execution, passability, win condition
    validator.py  BFS solvability check (cap: 200k states)
  ai/
    prompts.py    solve + design system prompts, message builders, output parsers
    client.py     Anthropic SDK wrapper
  cli/
    validate.py   loreval validate
    solve.py      loreval solve
    design.py     loreval design
    loop.py       loreval loop
    eval.py       loreval eval

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.idea		.idea
loreval.egg-info		loreval.egg-info
loreval		loreval
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

loreval

How it works

Tile types

Install

Commands

`validate` — check a puzzle file

`solve` — run a model against a puzzle

`design` — ask a model to create a puzzle

`loop` — design then solve

`eval` — batch evaluation

Supported models

Project structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

loreval

How it works

Tile types

Install

Commands

validate — check a puzzle file

solve — run a model against a puzzle

design — ask a model to create a puzzle

loop — design then solve

eval — batch evaluation

Supported models

Project structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`validate` — check a puzzle file

`solve` — run a model against a puzzle

`design` — ask a model to create a puzzle

`loop` — design then solve

`eval` — batch evaluation

Packages