Skip to content

Hf space#612

Open
Rhushya wants to merge 48 commits into
huggingface:mainfrom
Rhushya:hf-space
Open

Hf space#612
Rhushya wants to merge 48 commits into
huggingface:mainfrom
Rhushya:hf-space

Conversation

@Rhushya
Copy link
Copy Markdown

@Rhushya Rhushya commented Apr 25, 2026

Summary

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation
  • New environment
  • Refactoring

Alignment Checklist

Before submitting, verify:

  • I have read .claude/docs/PRINCIPLES.md and this PR aligns with our principles
  • I have checked .claude/docs/INVARIANTS.md and no invariants are violated
  • I have run /pre-submit-pr (or bash .claude/hooks/lint.sh and tests) and addressed all issues

RFC Status

  • Not required (bug fix, docs, minor refactoring)
  • RFC exists: #___
  • RFC needed (will create before merge)

Test Plan

Claude Code Review

Rhushya and others added 30 commits April 3, 2026 07:12
- 4 specialist agents (triage/escalation/compliance/responder)
- Schema drift engine with mid-episode policy mutations
- 5 independent reward functions for TRL GRPO
- Anti-reward-hacking: validation, timeout, repetition detection, reward capping
- Curriculum learning support (easy -> medium -> hard)
- 11 deterministic graders, all verifiable
- 4 difficulty tiers (easy/medium/hard/adversarial)
- Backward compatible: easy mode identical to Round 1
- Eval benchmark with 3 baseline agents
- Comprehensive test suite (8 tests passing)
- Pitch script and Q&A cheat sheet
Remove generated/tooling files, fix UI and training script correctness issues, move tests into the standard suite, and add a practical next-steps guide for Hugging Face demo deployment and Colab T4 GRPO training.

Made-with: Cursor
Rhushya and others added 17 commits April 25, 2026 13:40
Correct UI category/options handling and optional Gradio imports, harden GRPO cache concurrency and shebang execution, align server startup behavior, and ensure env tests run under the repository-standard tests path.

Made-with: Cursor
Add a complete end-to-end presentation playbook, polished cyber-style demo UI copy, and a ready-to-upload Space template so training on Colab T4 and final deployment can be executed quickly for judging.

Made-with: Cursor
Fix tokenizer fallback and optimizer fallback in training, update docs with Colab-safe command usage, and provide a final ready-to-run T4 notebook that avoids shell/Python cell mixing errors.

Made-with: Cursor
Bypass package init side-effects in train_grpo imports, add fastmcp guidance for Colab installs, and publish a final T4 notebook for the Rhushya/OpenEnv repo with separated shell/Python execution cells.

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 25, 2026 12:09
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 25, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 25, 2026

Greptile Summary

This PR adds a complete email_triage_env environment — multi-turn email triage with specialist simulation, schema drift, SLA tracking, and composite deterministic rewards — along with a GRPO training script, a Gradio demo UI, and an HF Space deployment template.

Two P1 bugs need fixing before merge:

  • train_grpo.py uses Python's hash() as a reward cache key, which is non-deterministic across processes (PYTHONHASHSEED) and collision-prone, risking silent reward corruption during training.
  • The HF Space (space/app.py + DEPLOY.md) will fail at import time: the flat-copy deployment layout has no import path that resolves EmailTriageEnvironment or the base Action/Observation/State types for any of the copied modules.

Two Tier-2 alignment points need @darktex sign-off:

  • train_grpo.py imports server/ directly, creating a training/production lifecycle delta contrary to PRINCIPLES.md.
  • server/ui.py instantiates EmailTriageEnvironment and calls reset()/step() directly in UI handlers, bypassing the orchestration boundary in INVARIANTS.md.

Confidence Score: 2/5

Not safe to merge: two P1 bugs (corrupted reward cache, broken HF Space deployment) plus two unresolved alignment flags require human sign-off.

Two P1 findings — a silent reward-corruption risk in the training cache and a deployment-breaking import chain in the HF Space — cap the score at 4, and the unreviewed alignment flags (direct server/ import, UI bypassing orchestration boundary) pull it further down to 2.

envs/email_triage_env/train_grpo.py (cache key + reward scale + server/ import), space/app.py + DEPLOY.md (broken flat import chain), envs/email_triage_env/server/ui.py (direct env instantiation)

Important Files Changed

Filename Overview
envs/email_triage_env/train_grpo.py GRPO training script with two bugs: non-deterministic hash() cache key that can corrupt reward signals, and a reward scale mismatch where reward_format returns ±1.0 while all other reward functions return [0,1]. Also imports server/ directly, creating a lifecycle delta.
space/app.py HF Space entry point that will fail at import time: ui.py (copied from server/) has no fallback import path that works in the flat Space file layout described in DEPLOY.md.
envs/email_triage_env/server/email_triage_environment.py Well-structured multi-turn environment with backward-compatible easy mode, anti-reward-hacking, schema drift, and composite reward computation. Correctly keeps all reward logic server-side per PRINCIPLES.md.
envs/email_triage_env/server/graders.py Pure deterministic graders with no side effects; all rewards computed inside the environment boundary, consistent with the rewards-inside-environment invariant.
envs/email_triage_env/server/ui.py Gradio demo UI that directly instantiates EmailTriageEnvironment and calls reset()/step() — bypasses the two-interface boundary (alignment flag). Functionally fine for a human demo but architecturally misaligned.
space/env_types.py Duplicate of core Action/Observation/State base classes; not imported by any module in the current codebase (models.py never tries from env_types import ...), making it a dead file prone to drift.
envs/email_triage_env/client.py Clean EnvClient subclass; imports only from openenv.core and models.py, correctly respecting client-server separation.
envs/email_triage_env/models.py Pydantic models for Action/Observation/State with deep import-fallback chains; types are correct and wire-safe, though the fallback chain does not include env_types.py, which would break the HF Space deployment.
tests/envs/test_email_triage_env.py Functional smoke tests covering all difficulty tiers; print-heavy style rather than pure assertions, but covers the key behavioral contracts.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[GRPO Training\ntrain_grpo.py] -->|direct import ⚠️| B[EmailTriageEnvironment\nserver/]
    A -->|calls reset + step| B
    B --> C[graders.py\nDeterministic rewards]
    B --> D[schema_drift.py\nPolicy mutations]
    B --> E[stakeholders.py\nSpecialist simulation]
    F[EnvClient\nclient.py] -->|HTTP /reset /step| G[server/app.py\nFastAPI]
    G --> B
    H[space/app.py\nHF Space ⚠️] -->|from ui import| I[ui.py\nGradio UI]
    I -->|direct instantiation ⚠️| B
    style A fill:#f9a,stroke:#c00
    style H fill:#f9a,stroke:#c00
    style I fill:#ffd,stroke:#aa0
    style F fill:#afa,stroke:#0a0
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: envs/email_triage_env/train_grpo.py
Line: 63-64

Comment:
**Non-deterministic `hash()` as cache key**

`hash()` is randomized per Python process via `PYTHONHASHSEED` (since Python 3.3). In multi-worker GRPO training each worker computes a different hash for the same `(prompt, completion)` input, so the per-process cache never shares hits. More critically, within a single process two distinct pairs can collide to the same hash value, causing `_score` to silently return a stale cached reward and corrupt the training signal. Replace with a stable, collision-resistant string key — for example the raw string `prompt_text[-100:] + "|" + completion_text[:200]` (memory trade-off) or a `hashlib`-based hex digest.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: envs/email_triage_env/train_grpo.py
Line: 84-85

Comment:
**Reward scale mismatch across reward functions**

`reward_format` returns `±1.0` while all other reward functions (`reward_quality`, `reward_sla`, `reward_policy`, `reward_oversight`) return values in `[0.0, 1.0]`. GRPO sums all five signals during advantage estimation, so `reward_format` effectively has twice the influence of any other signal — and its negativity (`-1.0`) far outweighs the maximum of any other reward. Consider normalising to the same `[0.0, 1.0]` range:

```python
hacking_penalty = 1.0 if format_ok else 0.0
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: space/app.py
Line: 6-7

Comment:
**HF Space deployment will fail at import time**

`space/app.py` imports `from ui import build_ui`, and `ui.py` (copied from `server/`) tries to import `EmailTriageEnvironment` via three fallback chains, all requiring either the `envs.email_triage_env.server.*` package, `email_triage_env.server.*`, or `server.email_triage_environment` module. In the flat HF Space (per `DEPLOY.md`), files are copied to the root, so none of these import paths resolve — there is no `server/` sub-package and no `from email_triage_environment import …` fallback. `email_triage_environment.py` itself would also fail since it has `from server.graders import …` etc. that similarly cannot resolve in a flat layout.

A `from email_triage_environment import EmailTriageEnvironment` (and matching flat imports inside `email_triage_environment.py`) needs to be added as a final fallback for the Space to be deployable.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: envs/email_triage_env/train_grpo.py
Line: 36-37

Comment:
**ALIGNMENT FLAG: Training script imports directly from `server/`**

`from server.email_triage_environment import EmailTriageEnvironment` bypasses the client-server boundary defined in `INVARIANTS.md` ("Clients must never import from `server/` directory") and violates the `PRINCIPLES.md` principle of minimising lifecycle deltas ("Training → Evals → Production should use identical interfaces"). The training script interacts with the environment via raw Python object calls rather than through `EnvClient`, which means the training path diverges from the production path.

- **Principle at stake**: Minimize lifecycle deltas (PRINCIPLES.md) + Client-server separation (INVARIANTS.md)
- **The concern**: Direct `server/` import in the training path creates a training/production divergence that the two-interface model was designed to prevent.
- **Suggested reviewer**: `@darktex`

**Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9))

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: envs/email_triage_env/server/ui.py
Line: 30-40

Comment:
**ALIGNMENT FLAG: Gradio UI exposes `reset()` and `step()` directly**

`do_reset` and `do_step` create an `EmailTriageEnvironment` and call `reset()`/`step()` on it inside the UI handler. `INVARIANTS.md` states "The WebSocket interface for reset/step is for orchestration only" and "Agents cannot access reset/simulation controls." While this is a human-facing demo (not an agent), the UI bypasses the intended two-interface boundary entirely — there is no client or WebSocket layer between the UI and the environment object.

- **Principle at stake**: Agent isolation / Dual API boundary (INVARIANTS.md)
- **The concern**: Any code running in the same process as the UI (or any future LLM-backed demo user) can directly call `env.reset()` without going through the orchestration boundary.
- **Suggested reviewer**: `@darktex`

**Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9))

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: space/env_types.py
Line: 1-4

Comment:
**Duplicated core type definitions**

`space/env_types.py` re-implements `Action`, `Observation`, and `State` locally. These are the canonical base classes defined in `src/openenv/core/env_server/types.py`. Maintaining a separate copy risks diverging from the upstream definition. `models.py`'s import-fallback chain also never tries `from env_types import …`, so this file isn't wired into any import path — it appears to be a dead file in the current repo layout.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "Merge branch 'main' into hf-space" | Re-trigger Greptile

Comment on lines +63 to +64
cache_key = hash(prompt_text[-100:] + completion_text[:200])
with _CACHE_LOCK:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Non-deterministic hash() as cache key

hash() is randomized per Python process via PYTHONHASHSEED (since Python 3.3). In multi-worker GRPO training each worker computes a different hash for the same (prompt, completion) input, so the per-process cache never shares hits. More critically, within a single process two distinct pairs can collide to the same hash value, causing _score to silently return a stale cached reward and corrupt the training signal. Replace with a stable, collision-resistant string key — for example the raw string prompt_text[-100:] + "|" + completion_text[:200] (memory trade-off) or a hashlib-based hex digest.

Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/email_triage_env/train_grpo.py
Line: 63-64

Comment:
**Non-deterministic `hash()` as cache key**

`hash()` is randomized per Python process via `PYTHONHASHSEED` (since Python 3.3). In multi-worker GRPO training each worker computes a different hash for the same `(prompt, completion)` input, so the per-process cache never shares hits. More critically, within a single process two distinct pairs can collide to the same hash value, causing `_score` to silently return a stale cached reward and corrupt the training signal. Replace with a stable, collision-resistant string key — for example the raw string `prompt_text[-100:] + "|" + completion_text[:200]` (memory trade-off) or a `hashlib`-based hex digest.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +84 to +85
format_ok = cat_m is not None and pri_m is not None and esc_m is not None
hacking_penalty = 1.0 if format_ok else -1.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Reward scale mismatch across reward functions

reward_format returns ±1.0 while all other reward functions (reward_quality, reward_sla, reward_policy, reward_oversight) return values in [0.0, 1.0]. GRPO sums all five signals during advantage estimation, so reward_format effectively has twice the influence of any other signal — and its negativity (-1.0) far outweighs the maximum of any other reward. Consider normalising to the same [0.0, 1.0] range:

hacking_penalty = 1.0 if format_ok else 0.0
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/email_triage_env/train_grpo.py
Line: 84-85

Comment:
**Reward scale mismatch across reward functions**

`reward_format` returns `±1.0` while all other reward functions (`reward_quality`, `reward_sla`, `reward_policy`, `reward_oversight`) return values in `[0.0, 1.0]`. GRPO sums all five signals during advantage estimation, so `reward_format` effectively has twice the influence of any other signal — and its negativity (`-1.0`) far outweighs the maximum of any other reward. Consider normalising to the same `[0.0, 1.0]` range:

```python
hacking_penalty = 1.0 if format_ok else 0.0
```

How can I resolve this? If you propose a fix, please make it concise.

Comment thread space/app.py
Comment on lines +6 to +7
from ui import build_ui

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 HF Space deployment will fail at import time

space/app.py imports from ui import build_ui, and ui.py (copied from server/) tries to import EmailTriageEnvironment via three fallback chains, all requiring either the envs.email_triage_env.server.* package, email_triage_env.server.*, or server.email_triage_environment module. In the flat HF Space (per DEPLOY.md), files are copied to the root, so none of these import paths resolve — there is no server/ sub-package and no from email_triage_environment import … fallback. email_triage_environment.py itself would also fail since it has from server.graders import … etc. that similarly cannot resolve in a flat layout.

A from email_triage_environment import EmailTriageEnvironment (and matching flat imports inside email_triage_environment.py) needs to be added as a final fallback for the Space to be deployable.

Prompt To Fix With AI
This is a comment left during a code review.
Path: space/app.py
Line: 6-7

Comment:
**HF Space deployment will fail at import time**

`space/app.py` imports `from ui import build_ui`, and `ui.py` (copied from `server/`) tries to import `EmailTriageEnvironment` via three fallback chains, all requiring either the `envs.email_triage_env.server.*` package, `email_triage_env.server.*`, or `server.email_triage_environment` module. In the flat HF Space (per `DEPLOY.md`), files are copied to the root, so none of these import paths resolve — there is no `server/` sub-package and no `from email_triage_environment import …` fallback. `email_triage_environment.py` itself would also fail since it has `from server.graders import …` etc. that similarly cannot resolve in a flat layout.

A `from email_triage_environment import EmailTriageEnvironment` (and matching flat imports inside `email_triage_environment.py`) needs to be added as a final fallback for the Space to be deployable.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +36 to +37
from server.email_triage_environment import EmailTriageEnvironment
from models import EmailTriageAction
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 ALIGNMENT FLAG: Training script imports directly from server/

from server.email_triage_environment import EmailTriageEnvironment bypasses the client-server boundary defined in INVARIANTS.md ("Clients must never import from server/ directory") and violates the PRINCIPLES.md principle of minimising lifecycle deltas ("Training → Evals → Production should use identical interfaces"). The training script interacts with the environment via raw Python object calls rather than through EnvClient, which means the training path diverges from the production path.

  • Principle at stake: Minimize lifecycle deltas (PRINCIPLES.md) + Client-server separation (INVARIANTS.md)
  • The concern: Direct server/ import in the training path creates a training/production divergence that the two-interface model was designed to prevent.
  • Suggested reviewer: @darktex

Context Used: .claude/docs/INVARIANTS.md (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/email_triage_env/train_grpo.py
Line: 36-37

Comment:
**ALIGNMENT FLAG: Training script imports directly from `server/`**

`from server.email_triage_environment import EmailTriageEnvironment` bypasses the client-server boundary defined in `INVARIANTS.md` ("Clients must never import from `server/` directory") and violates the `PRINCIPLES.md` principle of minimising lifecycle deltas ("Training → Evals → Production should use identical interfaces"). The training script interacts with the environment via raw Python object calls rather than through `EnvClient`, which means the training path diverges from the production path.

- **Principle at stake**: Minimize lifecycle deltas (PRINCIPLES.md) + Client-server separation (INVARIANTS.md)
- **The concern**: Direct `server/` import in the training path creates a training/production divergence that the two-interface model was designed to prevent.
- **Suggested reviewer**: `@darktex`

**Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9))

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +30 to +40
def do_reset(difficulty):
seed = random.randint(0, 9999)
env = EmailTriageEnvironment(difficulty=difficulty)
obs = env.reset(seed=seed, difficulty=difficulty)
info = obs.info or {}

ticket_md = _fmt_ticket(obs)
spec_md = _fmt_specialists(info)
stats_md = _fmt_stats(info)
status = f"✅ Queue started! {info.get('queue_size', '?')} tickets in {difficulty.upper()} mode. Seed: {seed}"
return env, obs, info, ticket_md, spec_md, stats_md, status, 0.0, ""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 ALIGNMENT FLAG: Gradio UI exposes reset() and step() directly

do_reset and do_step create an EmailTriageEnvironment and call reset()/step() on it inside the UI handler. INVARIANTS.md states "The WebSocket interface for reset/step is for orchestration only" and "Agents cannot access reset/simulation controls." While this is a human-facing demo (not an agent), the UI bypasses the intended two-interface boundary entirely — there is no client or WebSocket layer between the UI and the environment object.

  • Principle at stake: Agent isolation / Dual API boundary (INVARIANTS.md)
  • The concern: Any code running in the same process as the UI (or any future LLM-backed demo user) can directly call env.reset() without going through the orchestration boundary.
  • Suggested reviewer: @darktex

Context Used: .claude/docs/INVARIANTS.md (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/email_triage_env/server/ui.py
Line: 30-40

Comment:
**ALIGNMENT FLAG: Gradio UI exposes `reset()` and `step()` directly**

`do_reset` and `do_step` create an `EmailTriageEnvironment` and call `reset()`/`step()` on it inside the UI handler. `INVARIANTS.md` states "The WebSocket interface for reset/step is for orchestration only" and "Agents cannot access reset/simulation controls." While this is a human-facing demo (not an agent), the UI bypasses the intended two-interface boundary entirely — there is no client or WebSocket layer between the UI and the environment object.

- **Principle at stake**: Agent isolation / Dual API boundary (INVARIANTS.md)
- **The concern**: Any code running in the same process as the UI (or any future LLM-backed demo user) can directly call `env.reset()` without going through the orchestration boundary.
- **Suggested reviewer**: `@darktex`

**Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9))

How can I resolve this? If you propose a fix, please make it concise.

Comment thread space/env_types.py
Comment on lines +1 to +4
# Renamed from types.py to avoid shadowing Python stdlib 'types' module
# Copyright (c) Meta Platforms, Inc. and affiliates.

from enum import Enum
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Duplicated core type definitions

space/env_types.py re-implements Action, Observation, and State locally. These are the canonical base classes defined in src/openenv/core/env_server/types.py. Maintaining a separate copy risks diverging from the upstream definition. models.py's import-fallback chain also never tries from env_types import …, so this file isn't wired into any import path — it appears to be a dead file in the current repo layout.

Prompt To Fix With AI
This is a comment left during a code review.
Path: space/env_types.py
Line: 1-4

Comment:
**Duplicated core type definitions**

`space/env_types.py` re-implements `Action`, `Observation`, and `State` locally. These are the canonical base classes defined in `src/openenv/core/env_server/types.py`. Maintaining a separate copy risks diverging from the upstream definition. `models.py`'s import-fallback chain also never tries `from env_types import …`, so this file isn't wired into any import path — it appears to be a dead file in the current repo layout.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds the “Oversight Inbox Arena” email-triage environment (multi-turn, multi-agent, schema drift), plus training/evaluation tooling and Hugging Face Space deployment scaffolding.

Changes:

  • Introduces a multi-ticket EmailTriageEnvironment with deterministic graders, specialist simulations, and schema drift.
  • Adds GRPO training, evaluation scripts, notebooks, and smoke/E2E tests (HTTP + env).
  • Adds Hugging Face Space templates and a Gradio UI for interactive demos.

Reviewed changes

Copilot reviewed 36 out of 38 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/envs/test_email_triage_http.py Adds an end-to-end HTTP server smoke test.
tests/envs/test_email_triage_env.py Adds environment-level smoke tests across difficulty tiers + determinism.
space/requirements.txt Space dependency list.
space/env_types.py Pydantic types used by the Space bundle (renamed to avoid stdlib shadowing).
space/app.py Space entrypoint to launch the Gradio UI.
space/README.md Space metadata card and short description.
space/DEPLOY.md Manual deployment/copy instructions for Hugging Face Spaces.
pre_training.json Stores baseline/benchmark metrics snapshot.
envs/email_triage_env/train_grpo.py Provides a GRPO training script with multiple reward functions.
envs/email_triage_env/server/ui.py Implements the Gradio UI for the arena.
envs/email_triage_env/server/stakeholders.py Implements simulated specialist agents.
envs/email_triage_env/server/schema_drift.py Implements mid-episode policy/schema drift engine.
envs/email_triage_env/server/scenario_generator.py Deterministically generates multi-ticket scenarios by seed.
envs/email_triage_env/server/graders.py Implements deterministic reward graders, including multi-turn components.
envs/email_triage_env/server/email_triage_environment.py Core environment logic (multi-turn queue, drift, specialists, anti-hack).
envs/email_triage_env/server/email_triage_dataset.json Adds labeled email dataset used by the environment.
envs/email_triage_env/server/app.py FastAPI app wiring via create_app.
envs/email_triage_env/server/Dockerfile Container build/run entry for the environment server.
envs/email_triage_env/pyproject.toml Packages the environment for install/use.
envs/email_triage_env/openenv.yaml Declares the OpenEnv manifest for the environment.
envs/email_triage_env/models.py Defines Action/Observation/State models with fallbacks.
envs/email_triage_env/inference.py Adds an inference runner that spins up the env container and queries an LLM.
envs/email_triage_env/hf_space_template/requirements.txt Template Space requirements for quick deployment.
envs/email_triage_env/hf_space_template/app.py Template Space app entrypoint.
envs/email_triage_env/hf_space_template/README.md Template Space instructions.
envs/email_triage_env/eval_benchmark.py Adds a deterministic evaluation harness for baseline agents.
envs/email_triage_env/colab_t4_training.ipynb Colab training notebook for T4.
envs/email_triage_env/client.py Adds an OpenEnv client wrapper for the environment.
envs/email_triage_env/init.py Exposes public API for the env package.
envs/email_triage_env/Rhushya_OpenEnv_EmailTriage_Training.ipynb Additional training notebook/materials.
envs/email_triage_env/README_NEXT_STEPS.md Demo/training/deployment runbook.
envs/email_triage_env/README.md Full environment documentation and rationale.
envs/email_triage_env/FINAL_SHOWCASE_README.md Final presentation checklist/hand-off.
envs/email_triage_env/1.ipynb Additional Colab notebook copy.
envs/email_triage_env/.env.example Example environment variables for inference.
EmailTriage_GRPO_Train.ipynb Root-level GRPO notebook.
.gitignore Ignores local artifacts generated by tooling and evaluation outputs.
Comments suppressed due to low confidence (4)

tests/envs/test_email_triage_http.py:1

  • This test is prone to flakiness: it binds a fixed port (8099) which can conflict in parallel test runs/CI, and it uses a fixed sleep(3) instead of waiting for readiness. Also, server shutdown won’t run if an earlier assertion fails (no try/finally). Recommended: allocate an ephemeral free port, poll /health until ready (with a deadline), and ensure server.should_exit + join() happens in a finally block so the thread is always stopped.
    tests/envs/test_email_triage_http.py:1
  • This test is prone to flakiness: it binds a fixed port (8099) which can conflict in parallel test runs/CI, and it uses a fixed sleep(3) instead of waiting for readiness. Also, server shutdown won’t run if an earlier assertion fails (no try/finally). Recommended: allocate an ephemeral free port, poll /health until ready (with a deadline), and ensure server.should_exit + join() happens in a finally block so the thread is always stopped.
    tests/envs/test_email_triage_http.py:1
  • The comment says the first hard step “should NOT be done”, but the test never asserts that invariant (it only prints done_val). Add an assertion that data.get("done") is False here so the test actually validates the multi-turn behavior it describes.
    space/DEPLOY.md:1
  • The deployment instructions reference copying types.py, but this PR introduces space/env_types.py (explicitly renamed to avoid shadowing stdlib types). As written, following these steps will likely copy the wrong filename and break imports. Update the instructions to match the actual file name/layout used by the Space bundle.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

)

# ── 7. Determine done ────────────────────────────────────────
done = self._queue_index >= len(self._queue)
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

queue_position becomes queue_size + 1 on the terminal step because _queue_index is incremented before building info/metadata. This produces incorrect state/UX and can break clients that assume 1 <= queue_position <= queue_size. Consider computing a queue_position value that clamps to len(self._queue) when done=True (and use that consistently for both info and metadata).

Suggested change
done = self._queue_index >= len(self._queue)
done = self._queue_index >= len(self._queue)
queue_position = min(self._queue_index + 1, len(self._queue)) if self._queue else 0

Copilot uses AI. Check for mistakes.
Comment on lines +528 to +529
"queue_size": len(self._queue),
"queue_position": self._queue_index + 1,
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

queue_position becomes queue_size + 1 on the terminal step because _queue_index is incremented before building info/metadata. This produces incorrect state/UX and can break clients that assume 1 <= queue_position <= queue_size. Consider computing a queue_position value that clamps to len(self._queue) when done=True (and use that consistently for both info and metadata).

Copilot uses AI. Check for mistakes.
Comment on lines +579 to +580
"queue_position": self._queue_index + 1,
"queue_size": len(self._queue),
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

queue_position becomes queue_size + 1 on the terminal step because _queue_index is incremented before building info/metadata. This produces incorrect state/UX and can break clients that assume 1 <= queue_position <= queue_size. Consider computing a queue_position value that clamps to len(self._queue) when done=True (and use that consistently for both info and metadata).

Copilot uses AI. Check for mistakes.
Comment on lines +446 to +451
deadline = (
self._sla_deadlines[self._queue_index]
if self._queue_index < len(self._sla_deadlines)
else self._state.step_count
)
if self._state.step_count > deadline:
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As implemented, each step() always resolves exactly one ticket and advances the queue, so step_count grows as i+1 while the SLA deadlines are (i+1) * sla_steps_per_ticket. That means step_count > deadline will never be true for any sla_steps_per_ticket >= 1, so SLA breaches never occur and the SLA reward signal becomes effectively constant. To make SLA meaningful, either (a) allow multiple steps per ticket (don’t advance the queue on incorrect/unsafe actions), or (b) redefine deadlines relative to global episode budget or per-ticket elapsed steps in a way that can actually be exceeded.

Suggested change
deadline = (
self._sla_deadlines[self._queue_index]
if self._queue_index < len(self._sla_deadlines)
else self._state.step_count
)
if self._state.step_count > deadline:
# `_sla_deadlines` may be stored as cumulative queue deadlines
# (for example, `(i + 1) * sla_steps_per_ticket`). Because each
# step resolves exactly one ticket, comparing the current global
# `step_count` against that cumulative deadline can make breaches
# impossible for any SLA budget >= 1. Derive the current ticket's
# effective per-ticket SLA budget from the cumulative sequence so
# that the elapsed episode steps can meaningfully exceed it.
if self._queue_index < len(self._sla_deadlines):
cumulative_deadline = self._sla_deadlines[self._queue_index]
previous_cumulative_deadline = (
self._sla_deadlines[self._queue_index - 1]
if self._queue_index > 0
else 0
)
ticket_sla_budget = cumulative_deadline - previous_cumulative_deadline
else:
ticket_sla_budget = self._state.step_count
if self._state.step_count > ticket_sla_budget:

Copilot uses AI. Check for mistakes.
Comment on lines +481 to +483
# ── 6. Advance queue ─────────────────────────────────────────
self._queue_index += 1
self._state.tickets_resolved += 1
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As implemented, each step() always resolves exactly one ticket and advances the queue, so step_count grows as i+1 while the SLA deadlines are (i+1) * sla_steps_per_ticket. That means step_count > deadline will never be true for any sla_steps_per_ticket >= 1, so SLA breaches never occur and the SLA reward signal becomes effectively constant. To make SLA meaningful, either (a) allow multiple steps per ticket (don’t advance the queue on incorrect/unsafe actions), or (b) redefine deadlines relative to global episode budget or per-ticket elapsed steps in a way that can actually be exceeded.

Copilot uses AI. Check for mistakes.
Comment on lines +136 to +140
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This .ipynb file contains two top-level JSON notebook objects concatenated back-to-back (a second { ... } begins at line 139). That makes the notebook invalid JSON and it won’t open in Jupyter/Colab. Remove the duplicate notebook content (or split into separate files) so the file contains exactly one valid notebook JSON object.

Copilot uses AI. Check for mistakes.
| Pydantic serialization | All wire types are Pydantic models |
| Rewards inside environment boundary | All graders compute inside `step()` |
| Client-server separation | `client.py` never imports from `server/` |
| `SUPPORTS_CONCURRENT_SESSIONS = True` | Stateless across sessions |
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README states SUPPORTS_CONCURRENT_SESSIONS = True, but EmailTriageEnvironment sets SUPPORTS_CONCURRENT_SESSIONS = False in code. Please update the README to reflect the actual constant (or update the constant if the README is correct), since this affects how users deploy/run the server.

Suggested change
| `SUPPORTS_CONCURRENT_SESSIONS = True` | Stateless across sessions |
| `SUPPORTS_CONCURRENT_SESSIONS = False` | Concurrent sessions are not supported by the current environment implementation |

Copilot uses AI. Check for mistakes.
Comment on lines +181 to +188
# TRL ≥ v1.0 pulls in mergekit at import time; auto-install if missing.
import subprocess as _sp
for _pkg in ("mergekit", "fastmcp"):
try:
__import__(_pkg)
except ImportError:
print(f"[DEP] Installing missing dependency: {_pkg}")
_sp.check_call([sys.executable, "-m", "pip", "install", "-q", _pkg])
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installing packages at runtime from within the training script introduces reliability and security issues (non-reproducible environments, network failures mid-run, unexpected dependency resolution). Prefer moving these dependencies into an explicit requirements/lockfile or documenting a pip install ... command for Colab; if you must keep this behavior, gate it behind an explicit flag (e.g., --auto-install-deps) so “normal” runs don’t mutate the environment.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants