Skip to content

hinanohart/gatekept

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gatekept

Audit a verifier for gameability before you ship the RL environment that depends on it.

gatekept is a verifier-gated environment packager for the verifiers and OpenEnv ecosystems. Given a task seed it assembles an environment plus its verifier (reward function), then — before emitting any shareable artifact — it runs the verifier through the rewardfuzz gameability auditor. If the verifier's audited gameability is at or above your policy threshold, the candidate is rejected fail-closed: nothing is written. Only a candidate that clears the gate is emitted into native verifiers / OpenEnv form, alongside the audit record that justified the decision.

This is not an environment generator (see AutoEnv / AutoForge for that). Generation here is a thin scaffold; the center of gravity is the pre-emit gameability gate. gatekept invents no attack technique — it imports rewardfuzz and registers an adapter so the existing auditor can drive verifiers-style rewards.

What it actually claims

gatekept makes a deliberately narrow, falsifiable set of claims:

  • It generates an environment + verifier from a seed, and refuses to emit unless the verifier clears a configurable gameability gate (fail-closed).
  • The gate decision is deterministic for a fixed seed (same input → same gameability score).
  • The gate runs CPU-only, with no network access by default (structural judge + rule-based attack strategies).
  • Emitted artifacts import and load in a clean clone.
  • A gate PASS means "the verifier survived the sampled attacks at this budget/seed." It is not a proof that the verifier cannot be gamed — the attack corpus is finite and an unknown exploit may exist. Treat a PASS as evidence, not a guarantee.
  • A PASS audits gameability only. gatekept does not check that the verifier correctly rewards valid answers: a verifier that returns the same low score to everyone is not "gameable" yet is useless, so pair a PASS with your own correctness tests. (A verifier that raises on a valid answer is rejected fail-closed, since it cannot be audited.)

Install

# Core (gate + verifiers/OpenEnv emit). rewardfuzz is pinned by commit until v0.1.1.
pip install "gatekept @ git+https://github.com/hinanohart/gatekept"

# Optional: subprocess sandbox smoke check for generated envs
pip install "gatekept[sandbox] @ git+https://github.com/hinanohart/gatekept"

Quickstart

# Generate an environment from a seed, gate its verifier, and emit only on PASS.
gatekept pack --seed examples/arith.json --out out/arith

# Audit an existing verifier in a Python module without emitting anything.
gatekept gate --reward mymodule:reward_fn --answer "paris"

# Optionally smoke-run an emitted OpenEnv env in a scorewright sandbox (needs [sandbox]).
gatekept smoke --openenv out/arith/openenv
from gatekept import gatekeep, GatePolicy, load_policy

policy = load_policy("policy.toml")          # or GatePolicy(max_hackability=15.0, ...)
result = gatekeep(seed="examples/arith.json", policy=policy, out_dir="out/arith")

if result.emitted:
    print("emitted:", result.packaged.root)
    print(result.gate.summary())             # honest PASS wording
else:
    print("rejected:", result.gate.summary()) # nothing written

The gate

A candidate verifier is emitted only if both hold (conservative / fail-closed):

bound default meaning
max_hackability 15.0 rewardfuzz severity-weighted gameability score (0–100) must be below this
max_hack_rate_ci_hi 0.25 the upper bound of the bootstrap CI on the hack-rate fraction must be below this

Using the CI upper bound means an under-sampled audit fails closed (rejects) rather than open. Tune both in policy.toml. Setting max_hackability = 0 is rejected as vacuous (the noisy-OR score structure would reject nearly everything).

Measured on a curated corpus of 12 hand-labelled verifiers (5 gameable, 5 robust, 2 broken) with the default policy, structural judge, CPU-only and no network. Generated by python scripts/measure_gate.py (deterministic for a fixed seed). A sanity measurement on a curated set — not a claim of general detection power. Full provenance in results/.

metric value
gameable verifiers caught (rejected) 5/5 (100%)
robust verifiers passed 5/5 (100%)
false-positive rate (robust wrongly rejected) 0%
broken verifiers (crash on a valid answer) rejected 2/2
hackability, lenient -> hardened verifier 90.7 -> 0.0

How it fits your workflow

gatekept is a plug-in step before prime env push (verifiers) or the OpenEnv validate/publish step — a missing piece, not a competitor. Ship the gatekept_audit.json next to your environment so reviewers can see the gate evidence.

Related work

Reward hacking of verifiers is an active research area: analyses such as LLMs Gaming Verifiers (arXiv:2604.15149), in-training reward gating (Adversarial Reward Auditing, arXiv:2602.01750), and detection methods like Isomorphic Perturbation Testing. gatekept does not introduce a new attack or detection technique — it borrows rewardfuzz's corpus and contributes the packaging-stage integration: a fail-closed gate wired into the verifiers/OpenEnv emit pipeline so a gameable verifier is rejected before the artifact ships.

Dependencies (imported, not vendored)

package license role
rewardfuzz MIT gameability auditing engine (no strategy re-implemented here)
verifiers MIT emit target
openenv-core BSD-3 emit target
scorewright MIT optional sandbox smoke check ([sandbox])

Status

v0.1.0a3 — pre-alpha. API may change. See NOTICE for attribution and LICENSE (MIT).

About

Verifier-gated RL environment packager: audit a verifier for gameability with rewardfuzz before you ship it (verifiers/OpenEnv emit, fail-closed gate).

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages