[RFC 007] add Environment dataset (taskset) RFC#727
Conversation
Darktex
left a comment
There was a problem hiding this comment.
Note: This is an automated review by Claude Code, not a human review.
Alignment Review — PR #727
Tier 1: Spec-Level Issues
-
reset() signature mismatch with cursor: The RFC proposes
env.reset()binds row 0, but the Gymnasium API signature isreset(seed?, episode_id?). The RFC must show howepisode_idmaps to dataset row selection — a caller passingepisode_id=42would expect row 42, not row 0. -
Reward computation location unclear: The RFC introduces dataset rows with
openenv_reset/openenv_step/taskcolumn conventions but doesn't specify whether reward computation (which must live inside the environment per RFC 002/004) reads from the dataset row. If thetaskcolumn carries ground truth for reward logic, the RFC needs to state that the reward rubric lives server-side and the dataset row is input-only. -
"Exactly one of space_id, image, or package" not validated: The RFC declares this constraint but provides no schema validation approach. It should reference the existing
openenv.yamlschema validation pathway or specify how AutoEnv enforces mutual exclusivity at parse time. -
URI scheme ambiguity: The
hf://datasets/{repo_id}/{environment_id}@{revision}format doesn't clarify how{environment_id}resolves — is it a filename or a key within the YAMLenvironmentslist? Needs to be unambiguous for multiple environments in one repo. -
Phase 5 ("docs") has no testable deliverables: Phases 1-4 are concrete. Phase 5 should be fleshed out or merged into Phase 4.
Tier 2: Alignment Flags
ALIGNMENT FLAG: Dataset cursor iteration may expose reset control to agents
- Invariant: "Agents cannot reset" (RFC 001)
- Concern: If an agent can influence which row
step()reads next, or if the cursor wraps/is queryable, an agent could learn the dataset has a "restart" boundary. The RFC doesn't specify whether cursor position is hidden from agents.
ALIGNMENT FLAG: Verifiers framework declaration enables external reward computation
- Invariant: "Rewards inside environment" (RFC 002)
- Concern: Verifiers computes reward/verification outside the environment boundary. If a dataset-bound environment declares a Verifiers runtime, reward computation would violate the invariant. The RFC needs to either exclude reward from the Verifiers path or define how external scores are re-ingested.
ALIGNMENT FLAG: AutoEnv resolution may break client-server separation
- Invariant: Client-server separation
- Concern: If AutoEnv downloads and instantiates the
packageentry on the client side before a container boundary exists, it may import server-side code from the client context.
ALIGNMENT FLAG: Cursor model and "one env = one trajectory"
- Invariant: One env = one trajectory (RFC 004)
- Concern: The RFC should explicitly state that each
reset()starts a new trajectory bound to a new row, and the cursor doesn't create mid-episode task switching.
Verdict
5 spec issues + 4 alignment flags. The overall direction is sound but needs these clarifications before implementation begins.
Automated review by Claude Code | Learn more
Darktex
left a comment
There was a problem hiding this comment.
Note: This is an automated review by Claude Code, not a human review.
Alignment Review Report
Automated Checks
- Lint: PASS — No source code modified (RFC-only).
- Debug code: CLEAN.
Tier 1: Fixes Required
-
RFC numbering conflict (blocking) —
rfcs/006-hf-rl-environment-datasets.mdclaims RFC ID 006, but PR #624 and PR #731 also claim RFC 006 for different proposals. This must be renumbered (suggest 007). -
Authors field — Header reads
**Authors**: @ben. Should be@burtenshaw(the actual GitHub handle). -
README section — The
rfcs/README.mdupdate adds the link without a section heading. Existing entries use named categories. Add a section like "Dataset & Task Management".
Tier 2: Alignment Discussion
ALIGNMENT FLAG: Row cursor advances on every step(), conflating task progression with simulation steps
- Principle at stake: Gymnasium API invariant (
step(action) -> Observation) and "Agents cannot reset" - The concern: The RFC proposes
step()automatically advancesdataset_index(row 0 → row 1 → row 2). This means task selection — a simulation-control concern owned by orchestration — happens silently inside the agent's step path. In the standard Gymnasium model,reset()binds a task,step()executes actions within it. Binding a new task row per step breaks episode boundaries. If this is intentional for step-indexed datasets (e.g., TBench-2), it should be explicitly scoped as a special case, not the default. - Suggested reviewer: @Darktex
ALIGNMENT FLAG: env.dataset exposes raw dataset (including ground-truth) as a public attribute
- Principle at stake: Agent isolation, rewards inside environment
- The concern:
env.dataset,env.current_roware first-class public attributes. In harness-style integrations where the agent has access to the env client object, it can read ground-truth answers directly fromenv.dataset. The RFC acknowledges this risk but places the burden on environment authors. A safer design would make these private or use aTaskProviderprotocol that never surfaces raw rows to callers outside the server. - Suggested reviewer: @Darktex
ALIGNMENT FLAG: openenv_reset/openenv_step kwargs extend the Gymnasium API signature
- Principle at stake: Gymnasium API invariant (
reset(seed?, episode_id?) -> Observation) - The concern: The implementation calls
env.reset(split="train", index=0)— adding parameters toreset(). This violates the API signature invariant unless scoped to a subclassDatasetBoundEnvironmentthat does not change the base class signature. The RFC should explicitly address this. - Suggested reviewer: @Darktex
ALIGNMENT FLAG: trust_remote_code=False default but package runtime target bypasses it
- Principle at stake: Container isolation, no credential exposure
- The concern: For
frameworks.openenv.package, "it installs or imports through the existing auto-discovery path." Installing an arbitrary package executessetup.py/build hooks, bypassingtrust_remote_code=False. The RFC should specify whetherpackagetargets require explicit opt-in or must be installed inside the Docker container boundary. - Suggested reviewer: @Darktex
ALIGNMENT FLAG: Multi-framework environment.yaml scope
- Principle at stake: "One canonical way to build environments"
- The concern: The RFC designs
environment.yamlto serve Harbor and Verifiers alongside OpenEnv. This positions OpenEnv as steward of a multi-framework registry format — a significant scope expansion. Whether OpenEnv should own a multi-framework manifest or just anopenenv:block in a third-party-owned schema warrants explicit alignment. - Suggested reviewer: @Darktex
Summary
- 3 mechanical issues to fix (RFC number conflict is blocking)
- 5 alignment points for human review
- The core idea —
environment.yamlat dataset repo root consumed byAutoEnv— is sound and aligns with "minimize lifecycle deltas." The flags identify design decisions needing explicit resolution before implementation.
Automated review by Claude Code | Learn more
Darktex
left a comment
There was a problem hiding this comment.
Alignment Review — RFC 006: HF RL Environment Datasets
Verdict: Request Changes
The RFC is well-motivated and the design is coherent. Two categories of issues need addressing before this can move to Accepted.
Tier 1 — Mechanical Fixes
[T1-1] PR title says "RFC 007" but file and document use RFC 006 (CRITICAL)
The file is correctly named 006-hf-rl-environment-datasets.md and the document header reads # RFC 006. The PR title must be corrected to "RFC 006" to avoid confusing the index.
[T1-2] spec_version inconsistency in examples
In the "Optional OpenEnv Runtime Default" example near the end of the Examples section, openenv.yaml uses spec_version: 1 (bare integer), while every environment.yaml example uses spec_version: hf-rl-env-0.1 (string). Either use consistent values or add a comment explaining that openenv.yaml uses a different version scheme.
[T1-3] Template compliance: Key Design Decisions missing Trade-offs subsections
rfcs/README.md requires each design decision to have Chosen Approach / Rationale / Trade-offs sub-structure. The current "Key Design Decisions" section uses a flat bullet list. Please restructure or note that this RFC uses a condensed form intentionally.
[T1-4] Non-Goal 6 conflicts with the Verifiers adapter field in examples
Non-Goal #6: "No required remote code execution to load a task set." The Terminal-Bench example includes adapter: verifiers.v1.packages.tasksets.HarborTaskset, which names an importable class. OpenEnv ignores this field, but the example implies class-loading for the Verifiers framework. Add a clarifying comment noting that other frameworks' adapter fields are outside OpenEnv's execution scope.
Tier 2 — Alignment Flags
ALIGNMENT FLAG 1: Default row-cursor behavior and "agents cannot reset"
- Principle at stake: Agents cannot reset (RFC 001, INVARIANTS.md)
- The concern: The default behavior advances the dataset row cursor on every
step()call. In environments where an episode spans many steps over a single task, this model presents a new task to the agent on each action. The RFC should clarify whether the cursor-advance-on-step default is intended for single-step-per-task environments only, and how multi-step episodes over one task are handled. More importantly, sincestep()is called by the training orchestrator on behalf of the agent, the RFC should explicitly state that cursor advancement is controlled by the orchestration layer and cannot be triggered by the agent directly. - Suggested reviewer: @Darktex
ALIGNMENT FLAG 2: Multi-episode cursor traversal vs. "one env = one trajectory"
- Principle at stake: One env = one trajectory (RFC 004, PRINCIPLES.md)
- The concern: The RFC does not state whether
reset()resets the row cursor to 0. If a single dataset-bound environment instance is reused across multiple episodes (cursor advancing through the dataset), it produces many trajectories from one instance — in tension with the one-env-one-trajectory invariant. Please explicitly state thereset()cursor behavior and, if multi-episode reuse is intended, address how this squares with RFC 004. - Suggested reviewer: @Darktex
ALIGNMENT FLAG 3: Answer/verifier metadata leakage and "rewards inside environment"
- Principle at stake: Rewards inside environment (RFC 002, INVARIANTS.md)
- The concern: The canonical row schema includes
"answer": "42"in the task dict. The RFC states the runtime must not leak this to the agent, but the enforcement mechanism is unspecified. If the task dict is passed wholesale to the runtime and forwarded to observations, ground truth is agent-visible. The RFC should specify which layer is responsible for answer stripping and whether the framework enforces this or leaves it to environment authors. - Suggested reviewer: @Darktex
ALIGNMENT FLAG 4: Unresolved secrets handling
- Principle at stake: No credential exposure (INVARIANTS.md)
- The concern: The
secretsfield inenvironment.yamlis left to Open Question #5, which is unresolved. The implementation plan (Phases 1–2) should specify how the resolver reads secrets (env vars only?), validates their presence before runtime launch, and ensures they are never included in error messages or logs. This cannot be deferred past implementation. - Suggested reviewer: @Darktex
ALIGNMENT FLAG 5: trust_remote_code=False gives false safety with package-backed runtimes
- Principle at stake: Container isolation; Non-Goal 6 (no required remote code execution)
- The concern:
trust_remote_code=Falseis shown as a safe default, but forpackage-backed runtimes, Phase 2 says AutoEnv "installs or imports through the existing auto-discovery path" — which is remote code execution. The RFC must specify whattrust_remote_code=Falsedoes for thepackagebackend (block install? raise? warn?), so users are not given a false sense of security. - Suggested reviewer: @Darktex
Non-blocking Suggestions
- Add a direct Markdown link to the RFC 001 section that covers "tasks as reset-time inputs" (currently a bare prose reference).
- Consider adding a "Backward Compatibility" section documenting how existing
openenv.yamlmanifests withouttask_setscontinue to work. - Open Question #1 (
openenv-core[tasksets]extra) should be resolved before Phase 1 implementation to avoid an API decision made implicitly in code.
Note: This is an automated review by Claude Code, not a human review.
Automated review by Claude Code | Learn more
This PR adds RFC 006 for Hugging Face RL environment datasets, documenting dataset-root environment.yaml declarations, AutoEnv handling for hf://datasets references, and dataset-bound environment behavior. It also updates the RFC index so the proposal is discoverable.\n\nValidation: git diff --check and git diff --cached --check.