Quilltale

title

᚛ Quilltale ᚜

emoji

🪶

colorFrom

indigo

colorTo

gray

sdk

gradio

sdk_version

6.14.0

app_file

app.py

Quilltale

Quilltale is a text adventure powered by an AI Game Master that maintains a persistent, structured model of the game world. The story is generated by a language model. The facts are enforced by code.

Most AI games let the model hallucinate freely. It forgets what happened a few turns ago, invents items that were never there, and contradicts itself without noticing. Quilltale works differently. Every location, item, NPC, and player action is tracked in a formal world model that the AI reads before every response and cannot override. If you leave the dagger on the table, it stays there. If you threaten someone, they remember. If you ask to pick up something that is not in your location, the world rejects it, and the story reflects that reality.

The result is a game that actually holds together over a long session, which turns out to be a significantly harder problem than generating a good story.

Live Demo

Playing

Type what you want to do. The game understands natural actions:

look around
pick up the dagger
talk to Marta
ask Marta about the chest upstairs
go north
go to the market
use the key on the chest

You start in The Broken Flagon tavern with an old iron key in your pocket that you do not remember acquiring. There is a wanted notice on the wall, a dagger on the table, and a barkeep who is watching you more carefully than she should be. Where you go from there is up to you.

A few things worth knowing. You can only pick up items that are actually present in your current location. You can only move through exits that exist in the world, though natural destination phrasing works — "go to the market" is understood. NPCs remember every significant interaction across the entire session and their behaviour changes accordingly. The World Snapshot panel shows exactly what the game currently knows to be true.

The world

         [ Room 21 ]
              |
         [ Tavern ]   you start here
              |
          [ Street ]
           /       \
      [ Alley ]  [ Market ]

Five locations. Three characters. A mystery that connects them. The scratched message in the alley, the locked chest upstairs, the wanted notice on the wall, and the hooded figure who will not speak freely are all there for a reason.

How it is built

The interesting part of this project is keeping the world consistent across an unbounded number of turns, and not necessarily the story generation. That part is relatively straightforward given a capable language model.

World state model. Every fact about the world lives in a structured Python dataclass, not in the language model's context window. Locations know their exits, their items, and who is present. Items know where they are. NPCs track their disposition and their history with the player.

Validated transitions. When the AI decides something changed, that change is written as a structured JSON update and validated against the world model before being applied. The model cannot pick up an item that is not in the current location. It cannot move the player through a door that does not exist. Invalid transitions are caught, logged, and reflected in the narration. Multi-step movement is supported and validated step by step, routing through unvisited locations is blocked both at the code and prompt level.

Per-NPC episodic memory. Each NPC maintains a log of interactions with the player. Every entry has a turn number, a description, an emotional tone, and a significance score between one and three. Before each response, the most significant memories for NPCs in the current location are injected into the model's context. The barkeep remembering a threat from fifteen turns ago and responding accordingly is possible because the system is designed to retrieve past memories and carry them forward across interactions, so the model itself does not have to remember anything.

Grounded generation. The language model is responsible for the prose. The world model is responsible for the facts. These two concerns are separated. When the model writes something that contradicts the world state, the contradiction is caught before it reaches the player.

Evaluation

Quilltale includes an automated evaluation framework that plays through a fixed 20-turn scenario and measures three metrics using an LLM-as-judge approach.

python eval_runner.py
python eval_runner.py --no-judge   # transition rate only, no extra LLM calls

Results across two independent runs on the default world:

Metric	Run 1	Run 2
Factual consistency rate	85.0%	85.0%
Memory utilisation rate	84.6%	76.9%
Invalid transition rate	11.1%	12.5%

The consistency across both runs indicates stable behaviour.

Metric interpretation. Factual consistency measures whether narration contradicts recorded world state. Memory utilisation measures whether NPCs with stored memories actually reflect those memories in their behaviour. Invalid transition rate measures how often the GM proposes a state change the world model rejects.

Failure analysis Three failure patterns appear consistently across both runs.

The most interesting is what might be called plausible invention: Marta consistently invents an iron club she does not formally possess in the world state. The fabrication is narratively coherent, a hostile barkeep would reasonably have a weapon, but it is a fabrication. This reflects a genuine tension in grounded generation: the model fills in plausible details that the formal model does not track, and those details are often reasonable even when technically incorrect. Tracking NPC equipment more explicitly would catch this class of violation.

The second pattern is narration-state lag on multi-step movement. When the player moves from the market back toward the tavern, the world model correctly stops them at the street because market only exits west to street, not directly to the tavern. The narration sometimes describes arriving at the tavern regardless. The world state is always correct. The narration occasionally describes a destination one step further than the player actually reached.

The third pattern is judge miscalibration on movement turns. On turns where the player leaves a location, the NPC in that location naturally does not appear in narration about walking away. The judge flags these as memory utilisation failures because the NPC's memories are not reflected in the narration, but this is expected behaviour since an NPC's memories cannot be reflected through actions when the NPC is not present in the scene. Turns 11 and 15 in both runs follow this pattern and inflate the apparent memory failure count. Excluding those turns, the adjusted memory utilisation rate is 91.7% (run 1) and 90.9% (run 2).

Single rejection across both runs. The only rejected transition in both evaluations is "pick up dagger" on turn 8. This is correct behaviour: the dagger was moved to the player's inventory on turn 3 when the GM processed "examine the rusty dagger" and included a pickup in the state update. By turn 8 it is no longer in the location. The world model correctly rejects the attempt. The narration describing reaching for it anyway is the narration-state lag pattern described above.

Scene generation

Scene images are generated via the HuggingFace Inference API using FLUX.1-schnell( pollinations api as fallback), triggered only when the player moves to a new location, not on every turn. This keeps generation infrequent and the demo responsive. The image prompt is written by the GM as part of the same JSON response that updates world state, using the same structured output mechanism.

On the HuggingFace free inference tier, generation occasionally fails with rate limit or payment errors when monthly credits are exhausted. The game handles this gracefully, failed generation returns None and the last scene image persists rather than crashing the session. Latency on successful calls is typically 8 to 15 seconds on the free CPU tier.

Stack

Python 3.11, Gradio 6, Gemini 2.5 Flash, FLUX.1-schnell via HuggingFace Inference API.

The LLM layer is abstracted behind a common interface. Swapping Gemini for Claude is a single config line change.

Running locally

git clone https://github.com/aeesh/quilltale
cd quilltale

python3.11 -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

echo "GEMINI_API_KEY=your_key_here" > .env

gradio app.py

A free Gemini API key is available at aistudio.google.com. Image generation requires a HuggingFace token set as HF_TOKEN with inference provider permissions enabled.

Project layout

quilltale/
├── src/
│   ├── world/
│   │   └── state.py          WorldState, Location, NPC, Item, MemoryEntry
│   ├── agents/
│   │   └── game_master.py    GM agent — narrates and updates world state
│   ├── llm/
│   │   ├── base.py           Abstract LLM interface
│   │   ├── gemini.py         Gemini implementation
│   │   └── claude.py         Claude implementation
│   └── image/
│       └── flux.py           Scene image generation via HF Inference API
├── data/worlds/
│   └── default.json          The Ashen Reach world definition
├── assets/
│   └── styles.css            All styling
├── eval_results/             Evaluation reports
├── eval_runner.py            Automated evaluation script
├── tests/
├── app.py
└── requirements.txt

Extending the world

The world is defined entirely in data/worlds/default.json. Adding a location means adding an entry and updating the exits of whatever connects to it. The GM, memory system, image engine, and evaluation framework all pick it up without code changes.

What I would improve next

NPC equipment tracking Marta's iron club appearing in narration without being formally tracked is the most consistent evaluation violation. Adding weapons to NPC inventory in the world schema would close this class of fabrication and enable combat mechanics.

Narration correction pass currently the GM proposes state changes and generates narration in a single call, so narration may describe outcomes that are later rejected by the world model. For deterministic actions this could be resolved by validating transitions before the GM call and injecting confirmed state into the prompt. For actions requiring GM judgment the dependency remains sequential, making a two-call architecture the complete solution at the cost of increased latency per turn.

Human validation of judge scores on a random subset to measure judge accuracy, particularly on the memory utilisation metric where the current calibration issue artificially deflates the reported rate.

Multi-session persistence via SQLite or a similar lightweight store, enabling the world state to survive tab closes and allowing longer narrative arcs.

Expanded world with more locations, NPCs, and items, plus a second mystery arc to demonstrate that the evaluation framework generalises beyond the default scenario.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quilltale

Playing

The world

How it is built

Evaluation

Scene generation

Stack

Running locally

Project layout

Extending the world

What I would improve next

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
assets		assets
data/worlds		data/worlds
eval_results		eval_results
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
app.py		app.py
eval_runner.py		eval_runner.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Quilltale

Playing

The world

How it is built

Evaluation

Scene generation

Stack

Running locally

Project layout

Extending the world

What I would improve next

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages