Hf space by Rhushya · Pull Request #612 · huggingface/OpenEnv

Rhushya · 2026-04-25T12:09:18Z

Summary

Type of Change

Alignment Checklist

Before submitting, verify:

I have read .claude/docs/PRINCIPLES.md and this PR aligns with our principles
I have checked .claude/docs/INVARIANTS.md and no invariants are violated
I have run /pre-submit-pr (or bash .claude/hooks/lint.sh and tests) and addressed all issues

RFC Status

Not required (bug fix, docs, minor refactoring)
RFC exists: #___
RFC needed (will create before merge)

Test Plan

Claude Code Review

- 4 specialist agents (triage/escalation/compliance/responder) - Schema drift engine with mid-episode policy mutations - 5 independent reward functions for TRL GRPO - Anti-reward-hacking: validation, timeout, repetition detection, reward capping - Curriculum learning support (easy -> medium -> hard) - 11 deterministic graders, all verifiable - 4 difficulty tiers (easy/medium/hard/adversarial) - Backward compatible: easy mode identical to Round 1 - Eval benchmark with 3 baseline agents - Comprehensive test suite (8 tests passing) - Pitch script and Q&A cheat sheet

…namically via XML

…ler lora, smaller seq length)

…ompatibility

Remove generated/tooling files, fix UI and training script correctness issues, move tests into the standard suite, and add a practical next-steps guide for Hugging Face demo deployment and Colab T4 GRPO training. Made-with: Cursor

Correct UI category/options handling and optional Gradio imports, harden GRPO cache concurrency and shebang execution, align server startup behavior, and ensure env tests run under the repository-standard tests path. Made-with: Cursor

Add a complete end-to-end presentation playbook, polished cyber-style demo UI copy, and a ready-to-upload Space template so training on Colab T4 and final deployment can be executed quickly for judging. Made-with: Cursor

Fix tokenizer fallback and optimizer fallback in training, update docs with Colab-safe command usage, and provide a final ready-to-run T4 notebook that avoids shell/Python cell mixing errors. Made-with: Cursor

Bypass package init side-effects in train_grpo imports, add fastmcp guidance for Colab installs, and publish a final T4 notebook for the Rhushya/OpenEnv repo with separated shell/Python execution cells. Made-with: Cursor

…l email prompts

…h to 256

…gths for Qwen2.5-1.5B

…for training

…for models.py

…ts, all cells tested

…d correct app.py and requirements

greptile-apps · 2026-04-25T12:15:19Z

Greptile Summary

This PR adds a complete email_triage_env environment — multi-turn email triage with specialist simulation, schema drift, SLA tracking, and composite deterministic rewards — along with a GRPO training script, a Gradio demo UI, and an HF Space deployment template.

Two P1 bugs need fixing before merge:

train_grpo.py uses Python's hash() as a reward cache key, which is non-deterministic across processes (PYTHONHASHSEED) and collision-prone, risking silent reward corruption during training.
The HF Space (space/app.py + DEPLOY.md) will fail at import time: the flat-copy deployment layout has no import path that resolves EmailTriageEnvironment or the base Action/Observation/State types for any of the copied modules.

Two Tier-2 alignment points need @darktex sign-off:

train_grpo.py imports server/ directly, creating a training/production lifecycle delta contrary to PRINCIPLES.md.
server/ui.py instantiates EmailTriageEnvironment and calls reset()/step() directly in UI handlers, bypassing the orchestration boundary in INVARIANTS.md.

Confidence Score: 2/5

Not safe to merge: two P1 bugs (corrupted reward cache, broken HF Space deployment) plus two unresolved alignment flags require human sign-off.

Two P1 findings — a silent reward-corruption risk in the training cache and a deployment-breaking import chain in the HF Space — cap the score at 4, and the unreviewed alignment flags (direct server/ import, UI bypassing orchestration boundary) pull it further down to 2.

envs/email_triage_env/train_grpo.py (cache key + reward scale + server/ import), space/app.py + DEPLOY.md (broken flat import chain), envs/email_triage_env/server/ui.py (direct env instantiation)

Important Files Changed

Filename	Overview
envs/email_triage_env/train_grpo.py	GRPO training script with two bugs: non-deterministic hash() cache key that can corrupt reward signals, and a reward scale mismatch where reward_format returns ±1.0 while all other reward functions return [0,1]. Also imports server/ directly, creating a lifecycle delta.
space/app.py	HF Space entry point that will fail at import time: ui.py (copied from server/) has no fallback import path that works in the flat Space file layout described in DEPLOY.md.
envs/email_triage_env/server/email_triage_environment.py	Well-structured multi-turn environment with backward-compatible easy mode, anti-reward-hacking, schema drift, and composite reward computation. Correctly keeps all reward logic server-side per PRINCIPLES.md.
envs/email_triage_env/server/graders.py	Pure deterministic graders with no side effects; all rewards computed inside the environment boundary, consistent with the rewards-inside-environment invariant.
envs/email_triage_env/server/ui.py	Gradio demo UI that directly instantiates EmailTriageEnvironment and calls reset()/step() — bypasses the two-interface boundary (alignment flag). Functionally fine for a human demo but architecturally misaligned.
space/env_types.py	Duplicate of core Action/Observation/State base classes; not imported by any module in the current codebase (models.py never tries from env_types import ...), making it a dead file prone to drift.
envs/email_triage_env/client.py	Clean EnvClient subclass; imports only from openenv.core and models.py, correctly respecting client-server separation.
envs/email_triage_env/models.py	Pydantic models for Action/Observation/State with deep import-fallback chains; types are correct and wire-safe, though the fallback chain does not include env_types.py, which would break the HF Space deployment.
tests/envs/test_email_triage_env.py	Functional smoke tests covering all difficulty tiers; print-heavy style rather than pure assertions, but covers the key behavioral contracts.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[GRPO Training\ntrain_grpo.py] -->|direct import ⚠️| B[EmailTriageEnvironment\nserver/]
    A -->|calls reset + step| B
    B --> C[graders.py\nDeterministic rewards]
    B --> D[schema_drift.py\nPolicy mutations]
    B --> E[stakeholders.py\nSpecialist simulation]
    F[EnvClient\nclient.py] -->|HTTP /reset /step| G[server/app.py\nFastAPI]
    G --> B
    H[space/app.py\nHF Space ⚠️] -->|from ui import| I[ui.py\nGradio UI]
    I -->|direct instantiation ⚠️| B
    style A fill:#f9a,stroke:#c00
    style H fill:#f9a,stroke:#c00
    style I fill:#ffd,stroke:#aa0
    style F fill:#afa,stroke:#0a0

Prompt To Fix All With AI

This is a comment left during a code review.
Path: envs/email_triage_env/train_grpo.py
Line: 63-64

Comment:
**Non-deterministic `hash()` as cache key**

`hash()` is randomized per Python process via `PYTHONHASHSEED` (since Python 3.3). In multi-worker GRPO training each worker computes a different hash for the same `(prompt, completion)` input, so the per-process cache never shares hits. More critically, within a single process two distinct pairs can collide to the same hash value, causing `_score` to silently return a stale cached reward and corrupt the training signal. Replace with a stable, collision-resistant string key — for example the raw string `prompt_text[-100:] + "|" + completion_text[:200]` (memory trade-off) or a `hashlib`-based hex digest.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: envs/email_triage_env/train_grpo.py
Line: 84-85

Comment:
**Reward scale mismatch across reward functions**

`reward_format` returns `±1.0` while all other reward functions (`reward_quality`, `reward_sla`, `reward_policy`, `reward_oversight`) return values in `[0.0, 1.0]`. GRPO sums all five signals during advantage estimation, so `reward_format` effectively has twice the influence of any other signal — and its negativity (`-1.0`) far outweighs the maximum of any other reward. Consider normalising to the same `[0.0, 1.0]` range:

```python
hacking_penalty = 1.0 if format_ok else 0.0
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: space/app.py
Line: 6-7

Comment:
**HF Space deployment will fail at import time**

`space/app.py` imports `from ui import build_ui`, and `ui.py` (copied from `server/`) tries to import `EmailTriageEnvironment` via three fallback chains, all requiring either the `envs.email_triage_env.server.*` package, `email_triage_env.server.*`, or `server.email_triage_environment` module. In the flat HF Space (per `DEPLOY.md`), files are copied to the root, so none of these import paths resolve — there is no `server/` sub-package and no `from email_triage_environment import …` fallback. `email_triage_environment.py` itself would also fail since it has `from server.graders import …` etc. that similarly cannot resolve in a flat layout.

A `from email_triage_environment import EmailTriageEnvironment` (and matching flat imports inside `email_triage_environment.py`) needs to be added as a final fallback for the Space to be deployable.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: envs/email_triage_env/train_grpo.py
Line: 36-37

Comment:
**ALIGNMENT FLAG: Training script imports directly from `server/`**

`from server.email_triage_environment import EmailTriageEnvironment` bypasses the client-server boundary defined in `INVARIANTS.md` ("Clients must never import from `server/` directory") and violates the `PRINCIPLES.md` principle of minimising lifecycle deltas ("Training → Evals → Production should use identical interfaces"). The training script interacts with the environment via raw Python object calls rather than through `EnvClient`, which means the training path diverges from the production path.

- **Principle at stake**: Minimize lifecycle deltas (PRINCIPLES.md) + Client-server separation (INVARIANTS.md)
- **The concern**: Direct `server/` import in the training path creates a training/production divergence that the two-interface model was designed to prevent.
- **Suggested reviewer**: `@darktex`

**Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9))

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: envs/email_triage_env/server/ui.py
Line: 30-40

Comment:
**ALIGNMENT FLAG: Gradio UI exposes `reset()` and `step()` directly**

`do_reset` and `do_step` create an `EmailTriageEnvironment` and call `reset()`/`step()` on it inside the UI handler. `INVARIANTS.md` states "The WebSocket interface for reset/step is for orchestration only" and "Agents cannot access reset/simulation controls." While this is a human-facing demo (not an agent), the UI bypasses the intended two-interface boundary entirely — there is no client or WebSocket layer between the UI and the environment object.

- **Principle at stake**: Agent isolation / Dual API boundary (INVARIANTS.md)
- **The concern**: Any code running in the same process as the UI (or any future LLM-backed demo user) can directly call `env.reset()` without going through the orchestration boundary.
- **Suggested reviewer**: `@darktex`

**Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9))

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: space/env_types.py
Line: 1-4

Comment:
**Duplicated core type definitions**

`space/env_types.py` re-implements `Action`, `Observation`, and `State` locally. These are the canonical base classes defined in `src/openenv/core/env_server/types.py`. Maintaining a separate copy risks diverging from the upstream definition. `models.py`'s import-fallback chain also never tries `from env_types import …`, so this file isn't wired into any import path — it appears to be a dead file in the current repo layout.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "Merge branch 'main' into hf-space" | Re-trigger Greptile}

greptile-apps · 2026-04-25T12:15:23Z

+    cache_key = hash(prompt_text[-100:] + completion_text[:200])
+    with _CACHE_LOCK:


Non-deterministic hash() as cache key

hash() is randomized per Python process via PYTHONHASHSEED (since Python 3.3). In multi-worker GRPO training each worker computes a different hash for the same (prompt, completion) input, so the per-process cache never shares hits. More critically, within a single process two distinct pairs can collide to the same hash value, causing _score to silently return a stale cached reward and corrupt the training signal. Replace with a stable, collision-resistant string key — for example the raw string prompt_text[-100:] + "|" + completion_text[:200] (memory trade-off) or a hashlib-based hex digest.

Prompt To Fix With AI

This is a comment left during a code review. Path: envs/email_triage_env/train_grpo.py Line: 63-64 Comment: **Non-deterministic `hash()` as cache key** `hash()` is randomized per Python process via `PYTHONHASHSEED` (since Python 3.3). In multi-worker GRPO training each worker computes a different hash for the same `(prompt, completion)` input, so the per-process cache never shares hits. More critically, within a single process two distinct pairs can collide to the same hash value, causing `_score` to silently return a stale cached reward and corrupt the training signal. Replace with a stable, collision-resistant string key — for example the raw string `prompt_text[-100:] + "|" + completion_text[:200]` (memory trade-off) or a `hashlib`-based hex digest. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-25T12:15:24Z

+    format_ok = cat_m is not None and pri_m is not None and esc_m is not None
+    hacking_penalty = 1.0 if format_ok else -1.0


Reward scale mismatch across reward functions

reward_format returns ±1.0 while all other reward functions (reward_quality, reward_sla, reward_policy, reward_oversight) return values in [0.0, 1.0]. GRPO sums all five signals during advantage estimation, so reward_format effectively has twice the influence of any other signal — and its negativity (-1.0) far outweighs the maximum of any other reward. Consider normalising to the same [0.0, 1.0] range:

hacking_penalty = 1.0 if format_ok else 0.0

Prompt To Fix With AI

This is a comment left during a code review. Path: envs/email_triage_env/train_grpo.py Line: 84-85 Comment: **Reward scale mismatch across reward functions** `reward_format` returns `±1.0` while all other reward functions (`reward_quality`, `reward_sla`, `reward_policy`, `reward_oversight`) return values in `[0.0, 1.0]`. GRPO sums all five signals during advantage estimation, so `reward_format` effectively has twice the influence of any other signal — and its negativity (`-1.0`) far outweighs the maximum of any other reward. Consider normalising to the same `[0.0, 1.0]` range: ```python hacking_penalty = 1.0 if format_ok else 0.0 ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-25T12:15:25Z

+from ui import build_ui
+


HF Space deployment will fail at import time

space/app.py imports from ui import build_ui, and ui.py (copied from server/) tries to import EmailTriageEnvironment via three fallback chains, all requiring either the envs.email_triage_env.server.* package, email_triage_env.server.*, or server.email_triage_environment module. In the flat HF Space (per DEPLOY.md), files are copied to the root, so none of these import paths resolve — there is no server/ sub-package and no from email_triage_environment import … fallback. email_triage_environment.py itself would also fail since it has from server.graders import … etc. that similarly cannot resolve in a flat layout.

A from email_triage_environment import EmailTriageEnvironment (and matching flat imports inside email_triage_environment.py) needs to be added as a final fallback for the Space to be deployable.

Prompt To Fix With AI

This is a comment left during a code review. Path: space/app.py Line: 6-7 Comment: **HF Space deployment will fail at import time** `space/app.py` imports `from ui import build_ui`, and `ui.py` (copied from `server/`) tries to import `EmailTriageEnvironment` via three fallback chains, all requiring either the `envs.email_triage_env.server.*` package, `email_triage_env.server.*`, or `server.email_triage_environment` module. In the flat HF Space (per `DEPLOY.md`), files are copied to the root, so none of these import paths resolve — there is no `server/` sub-package and no `from email_triage_environment import …` fallback. `email_triage_environment.py` itself would also fail since it has `from server.graders import …` etc. that similarly cannot resolve in a flat layout. A `from email_triage_environment import EmailTriageEnvironment` (and matching flat imports inside `email_triage_environment.py`) needs to be added as a final fallback for the Space to be deployable. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-25T12:15:26Z

+from server.email_triage_environment import EmailTriageEnvironment
+from models import EmailTriageAction


ALIGNMENT FLAG: Training script imports directly from server/

from server.email_triage_environment import EmailTriageEnvironment bypasses the client-server boundary defined in INVARIANTS.md ("Clients must never import from server/ directory") and violates the PRINCIPLES.md principle of minimising lifecycle deltas ("Training → Evals → Production should use identical interfaces"). The training script interacts with the environment via raw Python object calls rather than through EnvClient, which means the training path diverges from the production path.

Principle at stake: Minimize lifecycle deltas (PRINCIPLES.md) + Client-server separation (INVARIANTS.md)

The concern: Direct server/ import in the training path creates a training/production divergence that the two-interface model was designed to prevent.

Suggested reviewer: @darktex

Context Used: .claude/docs/INVARIANTS.md (source)

Prompt To Fix With AI

This is a comment left during a code review. Path: envs/email_triage_env/train_grpo.py Line: 36-37 Comment: **ALIGNMENT FLAG: Training script imports directly from `server/`** `from server.email_triage_environment import EmailTriageEnvironment` bypasses the client-server boundary defined in `INVARIANTS.md` ("Clients must never import from `server/` directory") and violates the `PRINCIPLES.md` principle of minimising lifecycle deltas ("Training → Evals → Production should use identical interfaces"). The training script interacts with the environment via raw Python object calls rather than through `EnvClient`, which means the training path diverges from the production path. - **Principle at stake**: Minimize lifecycle deltas (PRINCIPLES.md) + Client-server separation (INVARIANTS.md) - **The concern**: Direct `server/` import in the training path creates a training/production divergence that the two-interface model was designed to prevent. - **Suggested reviewer**: `@darktex` **Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9)) How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-25T12:15:27Z

+def do_reset(difficulty):
+    seed = random.randint(0, 9999)
+    env = EmailTriageEnvironment(difficulty=difficulty)
+    obs = env.reset(seed=seed, difficulty=difficulty)
+    info = obs.info or {}
+
+    ticket_md = _fmt_ticket(obs)
+    spec_md = _fmt_specialists(info)
+    stats_md = _fmt_stats(info)
+    status = f"✅ Queue started! {info.get('queue_size', '?')} tickets in {difficulty.upper()} mode. Seed: {seed}"
+    return env, obs, info, ticket_md, spec_md, stats_md, status, 0.0, ""


ALIGNMENT FLAG: Gradio UI exposes reset() and step() directly

do_reset and do_step create an EmailTriageEnvironment and call reset()/step() on it inside the UI handler. INVARIANTS.md states "The WebSocket interface for reset/step is for orchestration only" and "Agents cannot access reset/simulation controls." While this is a human-facing demo (not an agent), the UI bypasses the intended two-interface boundary entirely — there is no client or WebSocket layer between the UI and the environment object.

Principle at stake: Agent isolation / Dual API boundary (INVARIANTS.md)

The concern: Any code running in the same process as the UI (or any future LLM-backed demo user) can directly call env.reset() without going through the orchestration boundary.

Suggested reviewer: @darktex

Context Used: .claude/docs/INVARIANTS.md (source)

Prompt To Fix With AI

This is a comment left during a code review. Path: envs/email_triage_env/server/ui.py Line: 30-40 Comment: **ALIGNMENT FLAG: Gradio UI exposes `reset()` and `step()` directly** `do_reset` and `do_step` create an `EmailTriageEnvironment` and call `reset()`/`step()` on it inside the UI handler. `INVARIANTS.md` states "The WebSocket interface for reset/step is for orchestration only" and "Agents cannot access reset/simulation controls." While this is a human-facing demo (not an agent), the UI bypasses the intended two-interface boundary entirely — there is no client or WebSocket layer between the UI and the environment object. - **Principle at stake**: Agent isolation / Dual API boundary (INVARIANTS.md) - **The concern**: Any code running in the same process as the UI (or any future LLM-backed demo user) can directly call `env.reset()` without going through the orchestration boundary. - **Suggested reviewer**: `@darktex` **Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9)) How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-25T12:15:28Z

+# Renamed from types.py to avoid shadowing Python stdlib 'types' module
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+
+from enum import Enum


Duplicated core type definitions

space/env_types.py re-implements Action, Observation, and State locally. These are the canonical base classes defined in src/openenv/core/env_server/types.py. Maintaining a separate copy risks diverging from the upstream definition. models.py's import-fallback chain also never tries from env_types import …, so this file isn't wired into any import path — it appears to be a dead file in the current repo layout.

Prompt To Fix With AI

This is a comment left during a code review. Path: space/env_types.py Line: 1-4 Comment: **Duplicated core type definitions** `space/env_types.py` re-implements `Action`, `Observation`, and `State` locally. These are the canonical base classes defined in `src/openenv/core/env_server/types.py`. Maintaining a separate copy risks diverging from the upstream definition. `models.py`'s import-fallback chain also never tries `from env_types import …`, so this file isn't wired into any import path — it appears to be a dead file in the current repo layout. How can I resolve this? If you propose a fix, please make it concise.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds the “Oversight Inbox Arena” email-triage environment (multi-turn, multi-agent, schema drift), plus training/evaluation tooling and Hugging Face Space deployment scaffolding.

Changes:

Introduces a multi-ticket EmailTriageEnvironment with deterministic graders, specialist simulations, and schema drift.
Adds GRPO training, evaluation scripts, notebooks, and smoke/E2E tests (HTTP + env).
Adds Hugging Face Space templates and a Gradio UI for interactive demos.

Reviewed changes

Copilot reviewed 36 out of 38 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
tests/envs/test_email_triage_http.py	Adds an end-to-end HTTP server smoke test.
tests/envs/test_email_triage_env.py	Adds environment-level smoke tests across difficulty tiers + determinism.
space/requirements.txt	Space dependency list.
space/env_types.py	Pydantic types used by the Space bundle (renamed to avoid stdlib shadowing).
space/app.py	Space entrypoint to launch the Gradio UI.
space/README.md	Space metadata card and short description.
space/DEPLOY.md	Manual deployment/copy instructions for Hugging Face Spaces.
pre_training.json	Stores baseline/benchmark metrics snapshot.
envs/email_triage_env/train_grpo.py	Provides a GRPO training script with multiple reward functions.
envs/email_triage_env/server/ui.py	Implements the Gradio UI for the arena.
envs/email_triage_env/server/stakeholders.py	Implements simulated specialist agents.
envs/email_triage_env/server/schema_drift.py	Implements mid-episode policy/schema drift engine.
envs/email_triage_env/server/scenario_generator.py	Deterministically generates multi-ticket scenarios by seed.
envs/email_triage_env/server/graders.py	Implements deterministic reward graders, including multi-turn components.
envs/email_triage_env/server/email_triage_environment.py	Core environment logic (multi-turn queue, drift, specialists, anti-hack).
envs/email_triage_env/server/email_triage_dataset.json	Adds labeled email dataset used by the environment.
envs/email_triage_env/server/app.py	FastAPI app wiring via `create_app`.
envs/email_triage_env/server/Dockerfile	Container build/run entry for the environment server.
envs/email_triage_env/pyproject.toml	Packages the environment for install/use.
envs/email_triage_env/openenv.yaml	Declares the OpenEnv manifest for the environment.
envs/email_triage_env/models.py	Defines Action/Observation/State models with fallbacks.
envs/email_triage_env/inference.py	Adds an inference runner that spins up the env container and queries an LLM.
envs/email_triage_env/hf_space_template/requirements.txt	Template Space requirements for quick deployment.
envs/email_triage_env/hf_space_template/app.py	Template Space app entrypoint.
envs/email_triage_env/hf_space_template/README.md	Template Space instructions.
envs/email_triage_env/eval_benchmark.py	Adds a deterministic evaluation harness for baseline agents.
envs/email_triage_env/colab_t4_training.ipynb	Colab training notebook for T4.
envs/email_triage_env/client.py	Adds an OpenEnv client wrapper for the environment.
envs/email_triage_env/init.py	Exposes public API for the env package.
envs/email_triage_env/Rhushya_OpenEnv_EmailTriage_Training.ipynb	Additional training notebook/materials.
envs/email_triage_env/README_NEXT_STEPS.md	Demo/training/deployment runbook.
envs/email_triage_env/README.md	Full environment documentation and rationale.
envs/email_triage_env/FINAL_SHOWCASE_README.md	Final presentation checklist/hand-off.
envs/email_triage_env/1.ipynb	Additional Colab notebook copy.
envs/email_triage_env/.env.example	Example environment variables for inference.
EmailTriage_GRPO_Train.ipynb	Root-level GRPO notebook.
.gitignore	Ignores local artifacts generated by tooling and evaluation outputs.

Comments suppressed due to low confidence (4)

tests/envs/test_email_triage_http.py:1

This test is prone to flakiness: it binds a fixed port (8099) which can conflict in parallel test runs/CI, and it uses a fixed sleep(3) instead of waiting for readiness. Also, server shutdown won’t run if an earlier assertion fails (no try/finally). Recommended: allocate an ephemeral free port, poll /health until ready (with a deadline), and ensure server.should_exit + join() happens in a finally block so the thread is always stopped.
tests/envs/test_email_triage_http.py:1
This test is prone to flakiness: it binds a fixed port (8099) which can conflict in parallel test runs/CI, and it uses a fixed sleep(3) instead of waiting for readiness. Also, server shutdown won’t run if an earlier assertion fails (no try/finally). Recommended: allocate an ephemeral free port, poll /health until ready (with a deadline), and ensure server.should_exit + join() happens in a finally block so the thread is always stopped.
tests/envs/test_email_triage_http.py:1
The comment says the first hard step “should NOT be done”, but the test never asserts that invariant (it only prints done_val). Add an assertion that data.get("done") is False here so the test actually validates the multi-turn behavior it describes.
space/DEPLOY.md:1
The deployment instructions reference copying types.py, but this PR introduces space/env_types.py (explicitly renamed to avoid shadowing stdlib types). As written, following these steps will likely copy the wrong filename and break imports. Update the instructions to match the actual file name/layout used by the Space bundle.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-25T12:15:30Z

+        )
+
+        # ── 7. Determine done ────────────────────────────────────────
+        done = self._queue_index >= len(self._queue)


queue_position becomes queue_size + 1 on the terminal step because _queue_index is incremented before building info/metadata. This produces incorrect state/UX and can break clients that assume 1 <= queue_position <= queue_size. Consider computing a queue_position value that clamps to len(self._queue) when done=True (and use that consistently for both info and metadata).

Suggested change

done = self._queue_index >= len(self._queue)

done = self._queue_index >= len(self._queue)

queue_position = min(self._queue_index + 1, len(self._queue)) if self._queue else 0

Copilot · 2026-04-25T12:15:30Z

+            "queue_size": len(self._queue),
+            "queue_position": self._queue_index + 1,


queue_position becomes queue_size + 1 on the terminal step because _queue_index is incremented before building info/metadata. This produces incorrect state/UX and can break clients that assume 1 <= queue_position <= queue_size. Consider computing a queue_position value that clamps to len(self._queue) when done=True (and use that consistently for both info and metadata).

Copilot · 2026-04-25T12:15:30Z

+                "queue_position": self._queue_index + 1,
+                "queue_size": len(self._queue),


queue_position becomes queue_size + 1 on the terminal step because _queue_index is incremented before building info/metadata. This produces incorrect state/UX and can break clients that assume 1 <= queue_position <= queue_size. Consider computing a queue_position value that clamps to len(self._queue) when done=True (and use that consistently for both info and metadata).

Copilot · 2026-04-25T12:15:31Z

+        deadline = (
+            self._sla_deadlines[self._queue_index]
+            if self._queue_index < len(self._sla_deadlines)
+            else self._state.step_count
+        )
+        if self._state.step_count > deadline:


As implemented, each step() always resolves exactly one ticket and advances the queue, so step_count grows as i+1 while the SLA deadlines are (i+1) * sla_steps_per_ticket. That means step_count > deadline will never be true for any sla_steps_per_ticket >= 1, so SLA breaches never occur and the SLA reward signal becomes effectively constant. To make SLA meaningful, either (a) allow multiple steps per ticket (don’t advance the queue on incorrect/unsafe actions), or (b) redefine deadlines relative to global episode budget or per-ticket elapsed steps in a way that can actually be exceeded.

Suggested change

deadline = (

self._sla_deadlines[self._queue_index]

if self._queue_index < len(self._sla_deadlines)

else self._state.step_count

)

if self._state.step_count > deadline:

# `_sla_deadlines` may be stored as cumulative queue deadlines

# (for example, `(i + 1) * sla_steps_per_ticket`). Because each

# step resolves exactly one ticket, comparing the current global

# `step_count` against that cumulative deadline can make breaches

# impossible for any SLA budget >= 1. Derive the current ticket's

# effective per-ticket SLA budget from the cumulative sequence so

# that the elapsed episode steps can meaningfully exceed it.

if self._queue_index < len(self._sla_deadlines):

cumulative_deadline = self._sla_deadlines[self._queue_index]

previous_cumulative_deadline = (

self._sla_deadlines[self._queue_index - 1]

if self._queue_index > 0

else 0

)

ticket_sla_budget = cumulative_deadline - previous_cumulative_deadline

else:

ticket_sla_budget = self._state.step_count

if self._state.step_count > ticket_sla_budget:

Copilot · 2026-04-25T12:15:31Z

+        # ── 6. Advance queue ─────────────────────────────────────────
+        self._queue_index += 1
+        self._state.tickets_resolved += 1


As implemented, each step() always resolves exactly one ticket and advances the queue, so step_count grows as i+1 while the SLA deadlines are (i+1) * sla_steps_per_ticket. That means step_count > deadline will never be true for any sla_steps_per_ticket >= 1, so SLA breaches never occur and the SLA reward signal becomes effectively constant. To make SLA meaningful, either (a) allow multiple steps per ticket (don’t advance the queue on incorrect/unsafe actions), or (b) redefine deadlines relative to global episode budget or per-ticket elapsed steps in a way that can actually be exceeded.

Copilot · 2026-04-25T12:15:31Z

+ "nbformat": 4,
+ "nbformat_minor": 5
+}
+{
+ "cells": [


This .ipynb file contains two top-level JSON notebook objects concatenated back-to-back (a second { ... } begins at line 139). That makes the notebook invalid JSON and it won’t open in Jupyter/Colab. Remove the duplicate notebook content (or split into separate files) so the file contains exactly one valid notebook JSON object.

Copilot · 2026-04-25T12:15:31Z

+| Pydantic serialization | All wire types are Pydantic models |
+| Rewards inside environment boundary | All graders compute inside `step()` |
+| Client-server separation | `client.py` never imports from `server/` |
+| `SUPPORTS_CONCURRENT_SESSIONS = True` | Stateless across sessions |


The README states SUPPORTS_CONCURRENT_SESSIONS = True, but EmailTriageEnvironment sets SUPPORTS_CONCURRENT_SESSIONS = False in code. Please update the README to reflect the actual constant (or update the constant if the README is correct), since this affects how users deploy/run the server.

Suggested change

| `SUPPORTS_CONCURRENT_SESSIONS = True` | Stateless across sessions |

| `SUPPORTS_CONCURRENT_SESSIONS = False` | Concurrent sessions are not supported by the current environment implementation |

Copilot · 2026-04-25T12:15:31Z

+    # TRL ≥ v1.0 pulls in mergekit at import time; auto-install if missing.
+    import subprocess as _sp
+    for _pkg in ("mergekit", "fastmcp"):
+        try:
+            __import__(_pkg)
+        except ImportError:
+            print(f"[DEP] Installing missing dependency: {_pkg}")
+            _sp.check_call([sys.executable, "-m", "pip", "install", "-q", _pkg])


Installing packages at runtime from within the training script introduces reliability and security issues (non-reproducible environments, network failures mid-run, unexpected dependency resolution). Prefer moving these dependencies into an explicit requirements/lockfile or documenting a pip install ... command for Colab; if you must keep this behavior, gate it behind an explicit flag (e.g., --auto-install-deps) so “normal” runs don’t mutate the environment.

Rhushya and others added 30 commits April 3, 2026 07:12

Add EmailTriageEnv: workflow RL environment for email triage

7a822a5

Harden EmailTriageEnv for Round 1 requirements

2393391

Add env var template for inference configuration

3345f53

Merge add-email-triage-env into main

e4bf65c

Align inference env vars with checklist requirements

3bc9042

Fix email triage app root route and create_app fallback

e194f25

Add uv.lock for validator multi-mode readiness

be7ac1c

Support Groq-compatible API keys in inference

0cba156

fix(inference): harden docker startup and prevent phase-2 crash

589bf1e

Fix phase 2 inference crash and submission hardening

5779347

Harden docker builds for phase 2 evaluator

3a97ff4

Make phase2 Docker base explicit for evaluator

e9d1931

Improve inference image tag and dockerfile fallbacks

ce429da

Fix reset imports across openenv package layouts

8eea401

feat: Add interactive Gradio UI for Hugging Face Space

9f99a9f

fix: remove root route to allow Gradio UI to mount

23fa3a2

fix: unpack error in gradio callbacks

88617af

Update environment and UI

2e55e22

fix: dynamically check bf16 support for older GPUs like Colab T4

ed3fe64

fix: remove environment_factory from GRPOTrainer and parse rewards dy…

a9a1064

…namically via XML

fix: handle list-type completions in reward functions

fc3f618

fix: load model in 4-bit via Unsloth to prevent Colab T4 OOM

ae69eea

fix: extreme VRAM optimizations for Colab T4 (8-bit paged adamw, smal…

bde6acd

…ler lora, smaller seq length)

fix: ultra low VRAM profile to guarantee Colab T4 compatibility

c7086e4

fix: switch to Qwen2-0.5B, complete rewrite for guaranteed Colab T4 c…

3d66f68

…ompatibility

feat: premium Gradio UI on HF Space + push-to-hub training support

246e64f

fix(space): pin gradio below v6 for ui compatibility

5451ff5

fix(ui): remove unsupported markdown scale args for gradio

1bb736d

Rhushya and others added 17 commits April 25, 2026 13:40

Finalize showcase docs and Hugging Face Space template.

235b1c3

Add a complete end-to-end presentation playbook, polished cyber-style demo UI copy, and a ready-to-upload Space template so training on Colab T4 and final deployment can be executed quickly for judging. Made-with: Cursor

Stabilize Colab T4 GRPO workflow and notebook.

bdc111b

Fix tokenizer fallback and optimizer fallback in training, update docs with Colab-safe command usage, and provide a final ready-to-run T4 notebook that avoids shell/Python cell mixing errors. Made-with: Cursor

Fix Colab T4 import failures and finalize training notebook.

3d90132

Bypass package init side-effects in train_grpo imports, add fastmcp guidance for Colab installs, and publish a final T4 notebook for the Rhushya/OpenEnv repo with separated shell/Python execution cells. Made-with: Cursor

fix: remove max_prompt_length from GRPOConfig (dropped in TRL v1.0+)

e54190b

chore: add Colab training notebook for email triage GRPO

f0810a5

fix: unblock GRPO learning — longer completions, reward variance, rea…

7c6c4cd

…l email prompts

feat: switch default model to Qwen2.5-1.5B, bump max_completion_lengt…

a3dd639

…h to 256

fix: remove JSON BOM, add temperature=0.9, bump completion/prompt len…

bd2e21d

…gths for Qwen2.5-1.5B

chore: update training notebooks — latest Colab version with all fixes

11d1589

fix: robust Environment import fallback — no longer requires fastmcp …

37a34e2

…for training

fix: auto-install mergekit and fastmcp before TRL import

f1b0b3a

fix: proper reward extraction in easy mode + robust import fallbacks …

3193142

…for models.py

chore: clean production notebook — pinned deps, no smoke-test conflic…

9b767d4

…ts, all cells tested

Add simple GRPO training notebook (no vLLM, Unsloth 4-bit, T4-safe)

14402fb

Create HF Space deployment — all files ready to copy to HuggingFace

6044a6b

Fix: rename types.py -> env_types.py to avoid Python stdlib clash; ad…

cf408a8

…d correct app.py and requirements

Copilot AI review requested due to automatic review settings April 25, 2026 12:09

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 25, 2026

Merge branch 'main' into hf-space

8313490

Copilot started reviewing on behalf of Rhushya April 25, 2026 12:09 View session

greptile-apps Bot reviewed Apr 25, 2026

View reviewed changes

Copilot AI reviewed Apr 25, 2026

View reviewed changes

		cache_key = hash(prompt_text[-100:] + completion_text[:200])
		with _CACHE_LOCK:

		format_ok = cat_m is not None and pri_m is not None and esc_m is not None
		hacking_penalty = 1.0 if format_ok else -1.0

		from server.email_triage_environment import EmailTriageEnvironment
		from models import EmailTriageAction

	done = self._queue_index >= len(self._queue)
	done = self._queue_index >= len(self._queue)
	queue_position = min(self._queue_index + 1, len(self._queue)) if self._queue else 0

		"queue_size": len(self._queue),
		"queue_position": self._queue_index + 1,

-        deadline = (
-            self._sla_deadlines[self._queue_index]
-            if self._queue_index < len(self._sla_deadlines)
-            else self._state.step_count
-        )
-        if self._state.step_count > deadline:
+        # `_sla_deadlines` may be stored as cumulative queue deadlines
+        # (for example, `(i + 1) * sla_steps_per_ticket`). Because each
+        # step resolves exactly one ticket, comparing the current global
+        # `step_count` against that cumulative deadline can make breaches
+        # impossible for any SLA budget >= 1. Derive the current ticket's
+        # effective per-ticket SLA budget from the cumulative sequence so
+        # that the elapsed episode steps can meaningfully exceed it.
+        if self._queue_index < len(self._sla_deadlines):
+            cumulative_deadline = self._sla_deadlines[self._queue_index]
+            previous_cumulative_deadline = (
+                self._sla_deadlines[self._queue_index - 1]
+                if self._queue_index > 0
+                else 0
+            )
+            ticket_sla_budget = cumulative_deadline - previous_cumulative_deadline
+        else:
+            ticket_sla_budget = self._state.step_count
+        if self._state.step_count > ticket_sla_budget:

	\| `SUPPORTS_CONCURRENT_SESSIONS = True` \| Stateless across sessions \|
	\| `SUPPORTS_CONCURRENT_SESSIONS = False` \| Concurrent sessions are not supported by the current environment implementation \|

Conversation

Rhushya commented Apr 25, 2026

Summary

Type of Change

Alignment Checklist

RFC Status

Test Plan

Claude Code Review

Uh oh!

greptile-apps Bot commented Apr 25, 2026

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants