[RFC 007] add Environment dataset (taskset) RFC by burtenshaw · Pull Request #727 · huggingface/OpenEnv

burtenshaw · 2026-05-21T13:28:24Z

This PR adds RFC 006 for Hugging Face RL environment datasets, documenting dataset-root environment.yaml declarations, AutoEnv handling for hf://datasets references, and dataset-bound environment behavior. It also updates the RFC index so the proposal is discoverable.\n\nValidation: git diff --check and git diff --cached --check.

Darktex

Note: This is an automated review by Claude Code, not a human review.

Alignment Review — PR #727

Tier 1: Spec-Level Issues

reset() signature mismatch with cursor: The RFC proposes env.reset() binds row 0, but the Gymnasium API signature is reset(seed?, episode_id?). The RFC must show how episode_id maps to dataset row selection — a caller passing episode_id=42 would expect row 42, not row 0.
Reward computation location unclear: The RFC introduces dataset rows with openenv_reset/openenv_step/task column conventions but doesn't specify whether reward computation (which must live inside the environment per RFC 002/004) reads from the dataset row. If the task column carries ground truth for reward logic, the RFC needs to state that the reward rubric lives server-side and the dataset row is input-only.
"Exactly one of space_id, image, or package" not validated: The RFC declares this constraint but provides no schema validation approach. It should reference the existing openenv.yaml schema validation pathway or specify how AutoEnv enforces mutual exclusivity at parse time.
URI scheme ambiguity: The hf://datasets/{repo_id}/{environment_id}@{revision} format doesn't clarify how {environment_id} resolves — is it a filename or a key within the YAML environments list? Needs to be unambiguous for multiple environments in one repo.
Phase 5 ("docs") has no testable deliverables: Phases 1-4 are concrete. Phase 5 should be fleshed out or merged into Phase 4.

Tier 2: Alignment Flags

ALIGNMENT FLAG: Dataset cursor iteration may expose reset control to agents

Invariant: "Agents cannot reset" (RFC 001)
Concern: If an agent can influence which row step() reads next, or if the cursor wraps/is queryable, an agent could learn the dataset has a "restart" boundary. The RFC doesn't specify whether cursor position is hidden from agents.

ALIGNMENT FLAG: Verifiers framework declaration enables external reward computation

Invariant: "Rewards inside environment" (RFC 002)
Concern: Verifiers computes reward/verification outside the environment boundary. If a dataset-bound environment declares a Verifiers runtime, reward computation would violate the invariant. The RFC needs to either exclude reward from the Verifiers path or define how external scores are re-ingested.

ALIGNMENT FLAG: AutoEnv resolution may break client-server separation

Invariant: Client-server separation
Concern: If AutoEnv downloads and instantiates the package entry on the client side before a container boundary exists, it may import server-side code from the client context.

ALIGNMENT FLAG: Cursor model and "one env = one trajectory"

Invariant: One env = one trajectory (RFC 004)
Concern: The RFC should explicitly state that each reset() starts a new trajectory bound to a new row, and the cursor doesn't create mid-episode task switching.

Verdict

5 spec issues + 4 alignment flags. The overall direction is sound but needs these clarifications before implementation begins.

Automated review by Claude Code | Learn more

Darktex

Note: This is an automated review by Claude Code, not a human review.

Alignment Review Report

Automated Checks

Lint: PASS — No source code modified (RFC-only).
Debug code: CLEAN.

Tier 1: Fixes Required

RFC numbering conflict (blocking) — rfcs/006-hf-rl-environment-datasets.md claims RFC ID 006, but PR #624 and PR #731 also claim RFC 006 for different proposals. This must be renumbered (suggest 007).
Authors field — Header reads **Authors**: @ben. Should be @burtenshaw (the actual GitHub handle).
README section — The rfcs/README.md update adds the link without a section heading. Existing entries use named categories. Add a section like "Dataset & Task Management".

Tier 2: Alignment Discussion

ALIGNMENT FLAG: Row cursor advances on every step(), conflating task progression with simulation steps

Principle at stake: Gymnasium API invariant (step(action) -> Observation) and "Agents cannot reset"
The concern: The RFC proposes step() automatically advances dataset_index (row 0 → row 1 → row 2). This means task selection — a simulation-control concern owned by orchestration — happens silently inside the agent's step path. In the standard Gymnasium model, reset() binds a task, step() executes actions within it. Binding a new task row per step breaks episode boundaries. If this is intentional for step-indexed datasets (e.g., TBench-2), it should be explicitly scoped as a special case, not the default.
Suggested reviewer: @Darktex

ALIGNMENT FLAG: env.dataset exposes raw dataset (including ground-truth) as a public attribute

Principle at stake: Agent isolation, rewards inside environment
The concern: env.dataset, env.current_row are first-class public attributes. In harness-style integrations where the agent has access to the env client object, it can read ground-truth answers directly from env.dataset. The RFC acknowledges this risk but places the burden on environment authors. A safer design would make these private or use a TaskProvider protocol that never surfaces raw rows to callers outside the server.
Suggested reviewer: @Darktex

ALIGNMENT FLAG: openenv_reset/openenv_step kwargs extend the Gymnasium API signature

Principle at stake: Gymnasium API invariant (reset(seed?, episode_id?) -> Observation)
The concern: The implementation calls env.reset(split="train", index=0) — adding parameters to reset(). This violates the API signature invariant unless scoped to a subclass DatasetBoundEnvironment that does not change the base class signature. The RFC should explicitly address this.
Suggested reviewer: @Darktex

ALIGNMENT FLAG: trust_remote_code=False default but package runtime target bypasses it

Principle at stake: Container isolation, no credential exposure
The concern: For frameworks.openenv.package, "it installs or imports through the existing auto-discovery path." Installing an arbitrary package executes setup.py/build hooks, bypassing trust_remote_code=False. The RFC should specify whether package targets require explicit opt-in or must be installed inside the Docker container boundary.
Suggested reviewer: @Darktex

ALIGNMENT FLAG: Multi-framework environment.yaml scope

Principle at stake: "One canonical way to build environments"
The concern: The RFC designs environment.yaml to serve Harbor and Verifiers alongside OpenEnv. This positions OpenEnv as steward of a multi-framework registry format — a significant scope expansion. Whether OpenEnv should own a multi-framework manifest or just an openenv: block in a third-party-owned schema warrants explicit alignment.
Suggested reviewer: @Darktex

Summary

3 mechanical issues to fix (RFC number conflict is blocking)
5 alignment points for human review
The core idea — environment.yaml at dataset repo root consumed by AutoEnv — is sound and aligns with "minimize lifecycle deltas." The flags identify design decisions needing explicit resolution before implementation.

Automated review by Claude Code | Learn more

Darktex

Alignment Review — RFC 006: HF RL Environment Datasets

Verdict: Request Changes

The RFC is well-motivated and the design is coherent. Two categories of issues need addressing before this can move to Accepted.

Tier 1 — Mechanical Fixes

[T1-1] PR title says "RFC 007" but file and document use RFC 006 (CRITICAL)

The file is correctly named 006-hf-rl-environment-datasets.md and the document header reads # RFC 006. The PR title must be corrected to "RFC 006" to avoid confusing the index.

[T1-2] spec_version inconsistency in examples

In the "Optional OpenEnv Runtime Default" example near the end of the Examples section, openenv.yaml uses spec_version: 1 (bare integer), while every environment.yaml example uses spec_version: hf-rl-env-0.1 (string). Either use consistent values or add a comment explaining that openenv.yaml uses a different version scheme.

[T1-3] Template compliance: Key Design Decisions missing Trade-offs subsections

rfcs/README.md requires each design decision to have Chosen Approach / Rationale / Trade-offs sub-structure. The current "Key Design Decisions" section uses a flat bullet list. Please restructure or note that this RFC uses a condensed form intentionally.

[T1-4] Non-Goal 6 conflicts with the Verifiers adapter field in examples

Non-Goal #6: "No required remote code execution to load a task set." The Terminal-Bench example includes adapter: verifiers.v1.packages.tasksets.HarborTaskset, which names an importable class. OpenEnv ignores this field, but the example implies class-loading for the Verifiers framework. Add a clarifying comment noting that other frameworks' adapter fields are outside OpenEnv's execution scope.

Tier 2 — Alignment Flags

ALIGNMENT FLAG 1: Default row-cursor behavior and "agents cannot reset"

Principle at stake: Agents cannot reset (RFC 001, INVARIANTS.md)
The concern: The default behavior advances the dataset row cursor on every step() call. In environments where an episode spans many steps over a single task, this model presents a new task to the agent on each action. The RFC should clarify whether the cursor-advance-on-step default is intended for single-step-per-task environments only, and how multi-step episodes over one task are handled. More importantly, since step() is called by the training orchestrator on behalf of the agent, the RFC should explicitly state that cursor advancement is controlled by the orchestration layer and cannot be triggered by the agent directly.
Suggested reviewer: @Darktex

ALIGNMENT FLAG 2: Multi-episode cursor traversal vs. "one env = one trajectory"

Principle at stake: One env = one trajectory (RFC 004, PRINCIPLES.md)
The concern: The RFC does not state whether reset() resets the row cursor to 0. If a single dataset-bound environment instance is reused across multiple episodes (cursor advancing through the dataset), it produces many trajectories from one instance — in tension with the one-env-one-trajectory invariant. Please explicitly state the reset() cursor behavior and, if multi-episode reuse is intended, address how this squares with RFC 004.
Suggested reviewer: @Darktex

ALIGNMENT FLAG 3: Answer/verifier metadata leakage and "rewards inside environment"

Principle at stake: Rewards inside environment (RFC 002, INVARIANTS.md)
The concern: The canonical row schema includes "answer": "42" in the task dict. The RFC states the runtime must not leak this to the agent, but the enforcement mechanism is unspecified. If the task dict is passed wholesale to the runtime and forwarded to observations, ground truth is agent-visible. The RFC should specify which layer is responsible for answer stripping and whether the framework enforces this or leaves it to environment authors.
Suggested reviewer: @Darktex

ALIGNMENT FLAG 4: Unresolved secrets handling

Principle at stake: No credential exposure (INVARIANTS.md)
The concern: The secrets field in environment.yaml is left to Open Question #5, which is unresolved. The implementation plan (Phases 1–2) should specify how the resolver reads secrets (env vars only?), validates their presence before runtime launch, and ensures they are never included in error messages or logs. This cannot be deferred past implementation.
Suggested reviewer: @Darktex

ALIGNMENT FLAG 5: trust_remote_code=False gives false safety with package-backed runtimes

Principle at stake: Container isolation; Non-Goal 6 (no required remote code execution)
The concern: trust_remote_code=False is shown as a safe default, but for package-backed runtimes, Phase 2 says AutoEnv "installs or imports through the existing auto-discovery path" — which is remote code execution. The RFC must specify what trust_remote_code=False does for the package backend (block install? raise? warn?), so users are not given a false sense of security.
Suggested reviewer: @Darktex

Non-blocking Suggestions

Add a direct Markdown link to the RFC 001 section that covers "tasks as reset-time inputs" (currently a bare prose reference).
Consider adding a "Backward Compatibility" section documenting how existing openenv.yaml manifests without task_sets continue to work.
Open Question #1 (openenv-core[tasksets] extra) should be resolved before Phase 1 implementation to avoid an API decision made implicitly in code.

Note: This is an automated review by Claude Code, not a human review.

Automated review by Claude Code | Learn more

docs: add hf rl environment datasets rfc

2ef60dd

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 21, 2026

burtenshaw changed the title ~~[codex] add env dataset RFC~~ [RFC] add Environment dataset (taskset) RFC May 29, 2026

burtenshaw marked this pull request as ready for review May 29, 2026 10:17

Darktex suggested changes May 29, 2026

View reviewed changes

burtenshaw changed the title ~~[RFC] add Environment dataset (taskset) RFC~~ [RFC 007] add Environment dataset (taskset) RFC Jun 1, 2026

Darktex suggested changes Jun 1, 2026

View reviewed changes

Darktex suggested changes Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC 007] add Environment dataset (taskset) RFC#727

[RFC 007] add Environment dataset (taskset) RFC#727
burtenshaw wants to merge 1 commit into
huggingface:mainfrom
burtenshaw:codex/hf-rl-env-datasets

burtenshaw commented May 21, 2026

Uh oh!

Darktex left a comment

Uh oh!

Darktex left a comment

Uh oh!

Darktex left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

burtenshaw commented May 21, 2026

Uh oh!

Darktex left a comment

Choose a reason for hiding this comment

Alignment Review — PR #727

Tier 1: Spec-Level Issues

Tier 2: Alignment Flags

Verdict

Uh oh!

Darktex left a comment

Choose a reason for hiding this comment

Alignment Review Report

Automated Checks

Tier 1: Fixes Required

Tier 2: Alignment Discussion

Summary

Uh oh!

Darktex left a comment

Choose a reason for hiding this comment

Alignment Review — RFC 006: HF RL Environment Datasets

Tier 1 — Mechanical Fixes

Tier 2 — Alignment Flags

Non-blocking Suggestions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants