gatekept

Audit a verifier for gameability before you ship the RL environment that depends on it.

gatekept is a verifier-gated environment packager for the verifiers and OpenEnv ecosystems. Given a task seed it assembles an environment plus its verifier (reward function), then — before emitting any shareable artifact — it runs the verifier through the rewardfuzz gameability auditor. If the verifier's audited gameability is at or above your policy threshold, the candidate is rejected fail-closed: nothing is written. Only a candidate that clears the gate is emitted into native verifiers / OpenEnv form, alongside the audit record that justified the decision.

This is not an environment generator (see AutoEnv / AutoForge for that). Generation here is a thin scaffold; the center of gravity is the pre-emit gameability gate. gatekept invents no attack technique — it imports rewardfuzz and registers an adapter so the existing auditor can drive verifiers-style rewards.

What it actually claims

gatekept makes a deliberately narrow, falsifiable set of claims:

It generates an environment + verifier from a seed, and refuses to emit unless the verifier clears a configurable gameability gate (fail-closed).
The gate decision is deterministic for a fixed seed (same input → same gameability score).
The gate runs CPU-only, with no network access by default (structural judge + rule-based attack strategies).
Emitted artifacts import and load in a clean clone.
A gate PASS means "the verifier survived the sampled attacks at this budget/seed." It is not a proof that the verifier cannot be gamed — the attack corpus is finite and an unknown exploit may exist. Treat a PASS as evidence, not a guarantee.
A PASS audits gameability only. gatekept does not check that the verifier correctly rewards valid answers: a verifier that returns the same low score to everyone is not "gameable" yet is useless, so pair a PASS with your own correctness tests. (A verifier that raises on a valid answer is rejected fail-closed, since it cannot be audited.)

Install

# Core (gate + verifiers/OpenEnv emit). rewardfuzz is pinned by commit until v0.1.1.
pip install "gatekept @ git+https://github.com/hinanohart/gatekept"

# Optional: subprocess sandbox smoke check for generated envs
pip install "gatekept[sandbox] @ git+https://github.com/hinanohart/gatekept"

Quickstart

# Generate an environment from a seed, gate its verifier, and emit only on PASS.
gatekept pack --seed examples/arith.json --out out/arith

# Audit an existing verifier in a Python module without emitting anything.
gatekept gate --reward mymodule:reward_fn --answer "paris"

# Optionally smoke-run an emitted OpenEnv env in a scorewright sandbox (needs [sandbox]).
gatekept smoke --openenv out/arith/openenv

from gatekept import gatekeep, GatePolicy, load_policy

policy = load_policy("policy.toml")          # or GatePolicy(max_hackability=15.0, ...)
result = gatekeep(seed="examples/arith.json", policy=policy, out_dir="out/arith")

if result.emitted:
    print("emitted:", result.packaged.root)
    print(result.gate.summary())             # honest PASS wording
else:
    print("rejected:", result.gate.summary()) # nothing written

The gate

A candidate verifier is emitted only if both hold (conservative / fail-closed):

bound	default	meaning
`max_hackability`	`15.0`	rewardfuzz severity-weighted gameability score (0–100) must be below this
`max_hack_rate_ci_hi`	`0.25`	the upper bound of the bootstrap CI on the hack-rate fraction must be below this

Using the CI upper bound means an under-sampled audit fails closed (rejects) rather than open. Tune both in policy.toml. Setting max_hackability = 0 is rejected as vacuous (the noisy-OR score structure would reject nearly everything).

Measured on a curated corpus of 12 hand-labelled verifiers (5 gameable, 5 robust, 2 broken) with the default policy, structural judge, CPU-only and no network. Generated by python scripts/measure_gate.py (deterministic for a fixed seed). A sanity measurement on a curated set — not a claim of general detection power. Full provenance in results/.

metric	value
gameable verifiers caught (rejected)	5/5 (100%)
robust verifiers passed	5/5 (100%)
false-positive rate (robust wrongly rejected)	0%
broken verifiers (crash on a valid answer) rejected	2/2
hackability, lenient -> hardened verifier	90.7 -> 0.0

How it fits your workflow

gatekept is a plug-in step before prime env push (verifiers) or the OpenEnv validate/publish step — a missing piece, not a competitor. Ship the gatekept_audit.json next to your environment so reviewers can see the gate evidence.

Related work

Reward hacking of verifiers is an active research area: analyses such as LLMs Gaming Verifiers (arXiv:2604.15149), in-training reward gating (Adversarial Reward Auditing, arXiv:2602.01750), and detection methods like Isomorphic Perturbation Testing. gatekept does not introduce a new attack or detection technique — it borrows rewardfuzz's corpus and contributes the packaging-stage integration: a fail-closed gate wired into the verifiers/OpenEnv emit pipeline so a gameable verifier is rejected before the artifact ships.

Dependencies (imported, not vendored)

package	license	role
rewardfuzz	MIT	gameability auditing engine (no strategy re-implemented here)
verifiers	MIT	emit target
openenv-core	BSD-3	emit target
scorewright	MIT	optional sandbox smoke check (`[sandbox]`)

Status

v0.1.0a3 — pre-alpha. API may change. See NOTICE for attribution and LICENSE (MIT).

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github		.github
examples		examples
results		results
scripts		scripts
src/gatekept		src/gatekept
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
policy.toml		policy.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gatekept

What it actually claims

Install

Quickstart

The gate

How it fits your workflow

Related work

Dependencies (imported, not vendored)

Status

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gatekept

What it actually claims

Install

Quickstart

The gate

How it fits your workflow

Related work

Dependencies (imported, not vendored)

Status

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages