Audit a verifier for gameability before you ship the RL environment that depends on it.
gatekept is a verifier-gated environment packager for the
verifiers and
OpenEnv ecosystems. Given a task seed it
assembles an environment plus its verifier (reward function), then — before emitting any
shareable artifact — it runs the verifier through the
rewardfuzz gameability auditor. If the verifier's
audited gameability is at or above your policy threshold, the candidate is rejected
fail-closed: nothing is written. Only a candidate that clears the gate is emitted into
native verifiers / OpenEnv form, alongside the audit record that justified the decision.
This is not an environment generator (see AutoEnv / AutoForge for that). Generation here is a thin scaffold; the center of gravity is the pre-emit gameability gate.
gatekeptinvents no attack technique — it importsrewardfuzzand registers an adapter so the existing auditor can driveverifiers-style rewards.
gatekept makes a deliberately narrow, falsifiable set of claims:
- It generates an environment + verifier from a seed, and refuses to emit unless the verifier clears a configurable gameability gate (fail-closed).
- The gate decision is deterministic for a fixed seed (same input → same gameability score).
- The gate runs CPU-only, with no network access by default (structural judge + rule-based attack strategies).
- Emitted artifacts import and load in a clean clone.
- A gate PASS means "the verifier survived the sampled attacks at this budget/seed." It is not a proof that the verifier cannot be gamed — the attack corpus is finite and an unknown exploit may exist. Treat a PASS as evidence, not a guarantee.
- A PASS audits gameability only.
gatekeptdoes not check that the verifier correctly rewards valid answers: a verifier that returns the same low score to everyone is not "gameable" yet is useless, so pair a PASS with your own correctness tests. (A verifier that raises on a valid answer is rejected fail-closed, since it cannot be audited.)
# Core (gate + verifiers/OpenEnv emit). rewardfuzz is pinned by commit until v0.1.1.
pip install "gatekept @ git+https://github.com/hinanohart/gatekept"
# Optional: subprocess sandbox smoke check for generated envs
pip install "gatekept[sandbox] @ git+https://github.com/hinanohart/gatekept"# Generate an environment from a seed, gate its verifier, and emit only on PASS.
gatekept pack --seed examples/arith.json --out out/arith
# Audit an existing verifier in a Python module without emitting anything.
gatekept gate --reward mymodule:reward_fn --answer "paris"
# Optionally smoke-run an emitted OpenEnv env in a scorewright sandbox (needs [sandbox]).
gatekept smoke --openenv out/arith/openenvfrom gatekept import gatekeep, GatePolicy, load_policy
policy = load_policy("policy.toml") # or GatePolicy(max_hackability=15.0, ...)
result = gatekeep(seed="examples/arith.json", policy=policy, out_dir="out/arith")
if result.emitted:
print("emitted:", result.packaged.root)
print(result.gate.summary()) # honest PASS wording
else:
print("rejected:", result.gate.summary()) # nothing writtenA candidate verifier is emitted only if both hold (conservative / fail-closed):
| bound | default | meaning |
|---|---|---|
max_hackability |
15.0 |
rewardfuzz severity-weighted gameability score (0–100) must be below this |
max_hack_rate_ci_hi |
0.25 |
the upper bound of the bootstrap CI on the hack-rate fraction must be below this |
Using the CI upper bound means an under-sampled audit fails closed (rejects) rather than open.
Tune both in policy.toml. Setting max_hackability = 0 is rejected as vacuous (the noisy-OR
score structure would reject nearly everything).
Measured on a curated corpus of 12 hand-labelled verifiers (5 gameable, 5 robust, 2 broken) with the default policy, structural judge, CPU-only and no network. Generated by python scripts/measure_gate.py (deterministic for a fixed seed). A sanity measurement on a curated set — not a claim of general detection power. Full provenance in results/.
| metric | value |
|---|---|
| gameable verifiers caught (rejected) | 5/5 (100%) |
| robust verifiers passed | 5/5 (100%) |
| false-positive rate (robust wrongly rejected) | 0% |
| broken verifiers (crash on a valid answer) rejected | 2/2 |
| hackability, lenient -> hardened verifier | 90.7 -> 0.0 |
gatekept is a plug-in step before prime env push (verifiers) or the OpenEnv
validate/publish step — a missing piece, not a competitor. Ship the gatekept_audit.json next
to your environment so reviewers can see the gate evidence.
Reward hacking of verifiers is an active research area: analyses such as LLMs Gaming
Verifiers (arXiv:2604.15149), in-training reward gating (Adversarial Reward Auditing,
arXiv:2602.01750), and detection methods like Isomorphic Perturbation Testing. gatekept does
not introduce a new attack or detection technique — it borrows rewardfuzz's corpus and
contributes the packaging-stage integration: a fail-closed gate wired into the
verifiers/OpenEnv emit pipeline so a gameable verifier is rejected before the artifact ships.
| package | license | role |
|---|---|---|
| rewardfuzz | MIT | gameability auditing engine (no strategy re-implemented here) |
| verifiers | MIT | emit target |
| openenv-core | BSD-3 | emit target |
| scorewright | MIT | optional sandbox smoke check ([sandbox]) |
v0.1.0a3 — pre-alpha. API may change. See NOTICE for attribution and LICENSE (MIT).