Hf space#612
Conversation
- 4 specialist agents (triage/escalation/compliance/responder) - Schema drift engine with mid-episode policy mutations - 5 independent reward functions for TRL GRPO - Anti-reward-hacking: validation, timeout, repetition detection, reward capping - Curriculum learning support (easy -> medium -> hard) - 11 deterministic graders, all verifiable - 4 difficulty tiers (easy/medium/hard/adversarial) - Backward compatible: easy mode identical to Round 1 - Eval benchmark with 3 baseline agents - Comprehensive test suite (8 tests passing) - Pitch script and Q&A cheat sheet
…namically via XML
…ler lora, smaller seq length)
Remove generated/tooling files, fix UI and training script correctness issues, move tests into the standard suite, and add a practical next-steps guide for Hugging Face demo deployment and Colab T4 GRPO training. Made-with: Cursor
Correct UI category/options handling and optional Gradio imports, harden GRPO cache concurrency and shebang execution, align server startup behavior, and ensure env tests run under the repository-standard tests path. Made-with: Cursor
Add a complete end-to-end presentation playbook, polished cyber-style demo UI copy, and a ready-to-upload Space template so training on Colab T4 and final deployment can be executed quickly for judging. Made-with: Cursor
Fix tokenizer fallback and optimizer fallback in training, update docs with Colab-safe command usage, and provide a final ready-to-run T4 notebook that avoids shell/Python cell mixing errors. Made-with: Cursor
Bypass package init side-effects in train_grpo imports, add fastmcp guidance for Colab installs, and publish a final T4 notebook for the Rhushya/OpenEnv repo with separated shell/Python execution cells. Made-with: Cursor
…gths for Qwen2.5-1.5B
…ts, all cells tested
…d correct app.py and requirements
Greptile SummaryThis PR adds a complete Two P1 bugs need fixing before merge:
Two Tier-2 alignment points need
Confidence Score: 2/5Not safe to merge: two P1 bugs (corrupted reward cache, broken HF Space deployment) plus two unresolved alignment flags require human sign-off. Two P1 findings — a silent reward-corruption risk in the training cache and a deployment-breaking import chain in the HF Space — cap the score at 4, and the unreviewed alignment flags (direct server/ import, UI bypassing orchestration boundary) pull it further down to 2. envs/email_triage_env/train_grpo.py (cache key + reward scale + server/ import), space/app.py + DEPLOY.md (broken flat import chain), envs/email_triage_env/server/ui.py (direct env instantiation) Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[GRPO Training\ntrain_grpo.py] -->|direct import ⚠️| B[EmailTriageEnvironment\nserver/]
A -->|calls reset + step| B
B --> C[graders.py\nDeterministic rewards]
B --> D[schema_drift.py\nPolicy mutations]
B --> E[stakeholders.py\nSpecialist simulation]
F[EnvClient\nclient.py] -->|HTTP /reset /step| G[server/app.py\nFastAPI]
G --> B
H[space/app.py\nHF Space ⚠️] -->|from ui import| I[ui.py\nGradio UI]
I -->|direct instantiation ⚠️| B
style A fill:#f9a,stroke:#c00
style H fill:#f9a,stroke:#c00
style I fill:#ffd,stroke:#aa0
style F fill:#afa,stroke:#0a0
Prompt To Fix All With AIThis is a comment left during a code review.
Path: envs/email_triage_env/train_grpo.py
Line: 63-64
Comment:
**Non-deterministic `hash()` as cache key**
`hash()` is randomized per Python process via `PYTHONHASHSEED` (since Python 3.3). In multi-worker GRPO training each worker computes a different hash for the same `(prompt, completion)` input, so the per-process cache never shares hits. More critically, within a single process two distinct pairs can collide to the same hash value, causing `_score` to silently return a stale cached reward and corrupt the training signal. Replace with a stable, collision-resistant string key — for example the raw string `prompt_text[-100:] + "|" + completion_text[:200]` (memory trade-off) or a `hashlib`-based hex digest.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: envs/email_triage_env/train_grpo.py
Line: 84-85
Comment:
**Reward scale mismatch across reward functions**
`reward_format` returns `±1.0` while all other reward functions (`reward_quality`, `reward_sla`, `reward_policy`, `reward_oversight`) return values in `[0.0, 1.0]`. GRPO sums all five signals during advantage estimation, so `reward_format` effectively has twice the influence of any other signal — and its negativity (`-1.0`) far outweighs the maximum of any other reward. Consider normalising to the same `[0.0, 1.0]` range:
```python
hacking_penalty = 1.0 if format_ok else 0.0
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: space/app.py
Line: 6-7
Comment:
**HF Space deployment will fail at import time**
`space/app.py` imports `from ui import build_ui`, and `ui.py` (copied from `server/`) tries to import `EmailTriageEnvironment` via three fallback chains, all requiring either the `envs.email_triage_env.server.*` package, `email_triage_env.server.*`, or `server.email_triage_environment` module. In the flat HF Space (per `DEPLOY.md`), files are copied to the root, so none of these import paths resolve — there is no `server/` sub-package and no `from email_triage_environment import …` fallback. `email_triage_environment.py` itself would also fail since it has `from server.graders import …` etc. that similarly cannot resolve in a flat layout.
A `from email_triage_environment import EmailTriageEnvironment` (and matching flat imports inside `email_triage_environment.py`) needs to be added as a final fallback for the Space to be deployable.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: envs/email_triage_env/train_grpo.py
Line: 36-37
Comment:
**ALIGNMENT FLAG: Training script imports directly from `server/`**
`from server.email_triage_environment import EmailTriageEnvironment` bypasses the client-server boundary defined in `INVARIANTS.md` ("Clients must never import from `server/` directory") and violates the `PRINCIPLES.md` principle of minimising lifecycle deltas ("Training → Evals → Production should use identical interfaces"). The training script interacts with the environment via raw Python object calls rather than through `EnvClient`, which means the training path diverges from the production path.
- **Principle at stake**: Minimize lifecycle deltas (PRINCIPLES.md) + Client-server separation (INVARIANTS.md)
- **The concern**: Direct `server/` import in the training path creates a training/production divergence that the two-interface model was designed to prevent.
- **Suggested reviewer**: `@darktex`
**Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9))
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: envs/email_triage_env/server/ui.py
Line: 30-40
Comment:
**ALIGNMENT FLAG: Gradio UI exposes `reset()` and `step()` directly**
`do_reset` and `do_step` create an `EmailTriageEnvironment` and call `reset()`/`step()` on it inside the UI handler. `INVARIANTS.md` states "The WebSocket interface for reset/step is for orchestration only" and "Agents cannot access reset/simulation controls." While this is a human-facing demo (not an agent), the UI bypasses the intended two-interface boundary entirely — there is no client or WebSocket layer between the UI and the environment object.
- **Principle at stake**: Agent isolation / Dual API boundary (INVARIANTS.md)
- **The concern**: Any code running in the same process as the UI (or any future LLM-backed demo user) can directly call `env.reset()` without going through the orchestration boundary.
- **Suggested reviewer**: `@darktex`
**Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9))
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: space/env_types.py
Line: 1-4
Comment:
**Duplicated core type definitions**
`space/env_types.py` re-implements `Action`, `Observation`, and `State` locally. These are the canonical base classes defined in `src/openenv/core/env_server/types.py`. Maintaining a separate copy risks diverging from the upstream definition. `models.py`'s import-fallback chain also never tries `from env_types import …`, so this file isn't wired into any import path — it appears to be a dead file in the current repo layout.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "Merge branch 'main' into hf-space" | Re-trigger Greptile |
| cache_key = hash(prompt_text[-100:] + completion_text[:200]) | ||
| with _CACHE_LOCK: |
There was a problem hiding this comment.
Non-deterministic
hash() as cache key
hash() is randomized per Python process via PYTHONHASHSEED (since Python 3.3). In multi-worker GRPO training each worker computes a different hash for the same (prompt, completion) input, so the per-process cache never shares hits. More critically, within a single process two distinct pairs can collide to the same hash value, causing _score to silently return a stale cached reward and corrupt the training signal. Replace with a stable, collision-resistant string key — for example the raw string prompt_text[-100:] + "|" + completion_text[:200] (memory trade-off) or a hashlib-based hex digest.
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/email_triage_env/train_grpo.py
Line: 63-64
Comment:
**Non-deterministic `hash()` as cache key**
`hash()` is randomized per Python process via `PYTHONHASHSEED` (since Python 3.3). In multi-worker GRPO training each worker computes a different hash for the same `(prompt, completion)` input, so the per-process cache never shares hits. More critically, within a single process two distinct pairs can collide to the same hash value, causing `_score` to silently return a stale cached reward and corrupt the training signal. Replace with a stable, collision-resistant string key — for example the raw string `prompt_text[-100:] + "|" + completion_text[:200]` (memory trade-off) or a `hashlib`-based hex digest.
How can I resolve this? If you propose a fix, please make it concise.| format_ok = cat_m is not None and pri_m is not None and esc_m is not None | ||
| hacking_penalty = 1.0 if format_ok else -1.0 |
There was a problem hiding this comment.
Reward scale mismatch across reward functions
reward_format returns ±1.0 while all other reward functions (reward_quality, reward_sla, reward_policy, reward_oversight) return values in [0.0, 1.0]. GRPO sums all five signals during advantage estimation, so reward_format effectively has twice the influence of any other signal — and its negativity (-1.0) far outweighs the maximum of any other reward. Consider normalising to the same [0.0, 1.0] range:
hacking_penalty = 1.0 if format_ok else 0.0Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/email_triage_env/train_grpo.py
Line: 84-85
Comment:
**Reward scale mismatch across reward functions**
`reward_format` returns `±1.0` while all other reward functions (`reward_quality`, `reward_sla`, `reward_policy`, `reward_oversight`) return values in `[0.0, 1.0]`. GRPO sums all five signals during advantage estimation, so `reward_format` effectively has twice the influence of any other signal — and its negativity (`-1.0`) far outweighs the maximum of any other reward. Consider normalising to the same `[0.0, 1.0]` range:
```python
hacking_penalty = 1.0 if format_ok else 0.0
```
How can I resolve this? If you propose a fix, please make it concise.| from ui import build_ui | ||
|
|
There was a problem hiding this comment.
HF Space deployment will fail at import time
space/app.py imports from ui import build_ui, and ui.py (copied from server/) tries to import EmailTriageEnvironment via three fallback chains, all requiring either the envs.email_triage_env.server.* package, email_triage_env.server.*, or server.email_triage_environment module. In the flat HF Space (per DEPLOY.md), files are copied to the root, so none of these import paths resolve — there is no server/ sub-package and no from email_triage_environment import … fallback. email_triage_environment.py itself would also fail since it has from server.graders import … etc. that similarly cannot resolve in a flat layout.
A from email_triage_environment import EmailTriageEnvironment (and matching flat imports inside email_triage_environment.py) needs to be added as a final fallback for the Space to be deployable.
Prompt To Fix With AI
This is a comment left during a code review.
Path: space/app.py
Line: 6-7
Comment:
**HF Space deployment will fail at import time**
`space/app.py` imports `from ui import build_ui`, and `ui.py` (copied from `server/`) tries to import `EmailTriageEnvironment` via three fallback chains, all requiring either the `envs.email_triage_env.server.*` package, `email_triage_env.server.*`, or `server.email_triage_environment` module. In the flat HF Space (per `DEPLOY.md`), files are copied to the root, so none of these import paths resolve — there is no `server/` sub-package and no `from email_triage_environment import …` fallback. `email_triage_environment.py` itself would also fail since it has `from server.graders import …` etc. that similarly cannot resolve in a flat layout.
A `from email_triage_environment import EmailTriageEnvironment` (and matching flat imports inside `email_triage_environment.py`) needs to be added as a final fallback for the Space to be deployable.
How can I resolve this? If you propose a fix, please make it concise.| from server.email_triage_environment import EmailTriageEnvironment | ||
| from models import EmailTriageAction |
There was a problem hiding this comment.
ALIGNMENT FLAG: Training script imports directly from
server/
from server.email_triage_environment import EmailTriageEnvironment bypasses the client-server boundary defined in INVARIANTS.md ("Clients must never import from server/ directory") and violates the PRINCIPLES.md principle of minimising lifecycle deltas ("Training → Evals → Production should use identical interfaces"). The training script interacts with the environment via raw Python object calls rather than through EnvClient, which means the training path diverges from the production path.
- Principle at stake: Minimize lifecycle deltas (PRINCIPLES.md) + Client-server separation (INVARIANTS.md)
- The concern: Direct
server/import in the training path creates a training/production divergence that the two-interface model was designed to prevent. - Suggested reviewer:
@darktex
Context Used: .claude/docs/INVARIANTS.md (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/email_triage_env/train_grpo.py
Line: 36-37
Comment:
**ALIGNMENT FLAG: Training script imports directly from `server/`**
`from server.email_triage_environment import EmailTriageEnvironment` bypasses the client-server boundary defined in `INVARIANTS.md` ("Clients must never import from `server/` directory") and violates the `PRINCIPLES.md` principle of minimising lifecycle deltas ("Training → Evals → Production should use identical interfaces"). The training script interacts with the environment via raw Python object calls rather than through `EnvClient`, which means the training path diverges from the production path.
- **Principle at stake**: Minimize lifecycle deltas (PRINCIPLES.md) + Client-server separation (INVARIANTS.md)
- **The concern**: Direct `server/` import in the training path creates a training/production divergence that the two-interface model was designed to prevent.
- **Suggested reviewer**: `@darktex`
**Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9))
How can I resolve this? If you propose a fix, please make it concise.| def do_reset(difficulty): | ||
| seed = random.randint(0, 9999) | ||
| env = EmailTriageEnvironment(difficulty=difficulty) | ||
| obs = env.reset(seed=seed, difficulty=difficulty) | ||
| info = obs.info or {} | ||
|
|
||
| ticket_md = _fmt_ticket(obs) | ||
| spec_md = _fmt_specialists(info) | ||
| stats_md = _fmt_stats(info) | ||
| status = f"✅ Queue started! {info.get('queue_size', '?')} tickets in {difficulty.upper()} mode. Seed: {seed}" | ||
| return env, obs, info, ticket_md, spec_md, stats_md, status, 0.0, "" |
There was a problem hiding this comment.
ALIGNMENT FLAG: Gradio UI exposes
reset() and step() directly
do_reset and do_step create an EmailTriageEnvironment and call reset()/step() on it inside the UI handler. INVARIANTS.md states "The WebSocket interface for reset/step is for orchestration only" and "Agents cannot access reset/simulation controls." While this is a human-facing demo (not an agent), the UI bypasses the intended two-interface boundary entirely — there is no client or WebSocket layer between the UI and the environment object.
- Principle at stake: Agent isolation / Dual API boundary (INVARIANTS.md)
- The concern: Any code running in the same process as the UI (or any future LLM-backed demo user) can directly call
env.reset()without going through the orchestration boundary. - Suggested reviewer:
@darktex
Context Used: .claude/docs/INVARIANTS.md (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/email_triage_env/server/ui.py
Line: 30-40
Comment:
**ALIGNMENT FLAG: Gradio UI exposes `reset()` and `step()` directly**
`do_reset` and `do_step` create an `EmailTriageEnvironment` and call `reset()`/`step()` on it inside the UI handler. `INVARIANTS.md` states "The WebSocket interface for reset/step is for orchestration only" and "Agents cannot access reset/simulation controls." While this is a human-facing demo (not an agent), the UI bypasses the intended two-interface boundary entirely — there is no client or WebSocket layer between the UI and the environment object.
- **Principle at stake**: Agent isolation / Dual API boundary (INVARIANTS.md)
- **The concern**: Any code running in the same process as the UI (or any future LLM-backed demo user) can directly call `env.reset()` without going through the orchestration boundary.
- **Suggested reviewer**: `@darktex`
**Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9))
How can I resolve this? If you propose a fix, please make it concise.| # Renamed from types.py to avoid shadowing Python stdlib 'types' module | ||
| # Copyright (c) Meta Platforms, Inc. and affiliates. | ||
|
|
||
| from enum import Enum |
There was a problem hiding this comment.
Duplicated core type definitions
space/env_types.py re-implements Action, Observation, and State locally. These are the canonical base classes defined in src/openenv/core/env_server/types.py. Maintaining a separate copy risks diverging from the upstream definition. models.py's import-fallback chain also never tries from env_types import …, so this file isn't wired into any import path — it appears to be a dead file in the current repo layout.
Prompt To Fix With AI
This is a comment left during a code review.
Path: space/env_types.py
Line: 1-4
Comment:
**Duplicated core type definitions**
`space/env_types.py` re-implements `Action`, `Observation`, and `State` locally. These are the canonical base classes defined in `src/openenv/core/env_server/types.py`. Maintaining a separate copy risks diverging from the upstream definition. `models.py`'s import-fallback chain also never tries `from env_types import …`, so this file isn't wired into any import path — it appears to be a dead file in the current repo layout.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds the “Oversight Inbox Arena” email-triage environment (multi-turn, multi-agent, schema drift), plus training/evaluation tooling and Hugging Face Space deployment scaffolding.
Changes:
- Introduces a multi-ticket
EmailTriageEnvironmentwith deterministic graders, specialist simulations, and schema drift. - Adds GRPO training, evaluation scripts, notebooks, and smoke/E2E tests (HTTP + env).
- Adds Hugging Face Space templates and a Gradio UI for interactive demos.
Reviewed changes
Copilot reviewed 36 out of 38 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/envs/test_email_triage_http.py | Adds an end-to-end HTTP server smoke test. |
| tests/envs/test_email_triage_env.py | Adds environment-level smoke tests across difficulty tiers + determinism. |
| space/requirements.txt | Space dependency list. |
| space/env_types.py | Pydantic types used by the Space bundle (renamed to avoid stdlib shadowing). |
| space/app.py | Space entrypoint to launch the Gradio UI. |
| space/README.md | Space metadata card and short description. |
| space/DEPLOY.md | Manual deployment/copy instructions for Hugging Face Spaces. |
| pre_training.json | Stores baseline/benchmark metrics snapshot. |
| envs/email_triage_env/train_grpo.py | Provides a GRPO training script with multiple reward functions. |
| envs/email_triage_env/server/ui.py | Implements the Gradio UI for the arena. |
| envs/email_triage_env/server/stakeholders.py | Implements simulated specialist agents. |
| envs/email_triage_env/server/schema_drift.py | Implements mid-episode policy/schema drift engine. |
| envs/email_triage_env/server/scenario_generator.py | Deterministically generates multi-ticket scenarios by seed. |
| envs/email_triage_env/server/graders.py | Implements deterministic reward graders, including multi-turn components. |
| envs/email_triage_env/server/email_triage_environment.py | Core environment logic (multi-turn queue, drift, specialists, anti-hack). |
| envs/email_triage_env/server/email_triage_dataset.json | Adds labeled email dataset used by the environment. |
| envs/email_triage_env/server/app.py | FastAPI app wiring via create_app. |
| envs/email_triage_env/server/Dockerfile | Container build/run entry for the environment server. |
| envs/email_triage_env/pyproject.toml | Packages the environment for install/use. |
| envs/email_triage_env/openenv.yaml | Declares the OpenEnv manifest for the environment. |
| envs/email_triage_env/models.py | Defines Action/Observation/State models with fallbacks. |
| envs/email_triage_env/inference.py | Adds an inference runner that spins up the env container and queries an LLM. |
| envs/email_triage_env/hf_space_template/requirements.txt | Template Space requirements for quick deployment. |
| envs/email_triage_env/hf_space_template/app.py | Template Space app entrypoint. |
| envs/email_triage_env/hf_space_template/README.md | Template Space instructions. |
| envs/email_triage_env/eval_benchmark.py | Adds a deterministic evaluation harness for baseline agents. |
| envs/email_triage_env/colab_t4_training.ipynb | Colab training notebook for T4. |
| envs/email_triage_env/client.py | Adds an OpenEnv client wrapper for the environment. |
| envs/email_triage_env/init.py | Exposes public API for the env package. |
| envs/email_triage_env/Rhushya_OpenEnv_EmailTriage_Training.ipynb | Additional training notebook/materials. |
| envs/email_triage_env/README_NEXT_STEPS.md | Demo/training/deployment runbook. |
| envs/email_triage_env/README.md | Full environment documentation and rationale. |
| envs/email_triage_env/FINAL_SHOWCASE_README.md | Final presentation checklist/hand-off. |
| envs/email_triage_env/1.ipynb | Additional Colab notebook copy. |
| envs/email_triage_env/.env.example | Example environment variables for inference. |
| EmailTriage_GRPO_Train.ipynb | Root-level GRPO notebook. |
| .gitignore | Ignores local artifacts generated by tooling and evaluation outputs. |
Comments suppressed due to low confidence (4)
tests/envs/test_email_triage_http.py:1
- This test is prone to flakiness: it binds a fixed port (
8099) which can conflict in parallel test runs/CI, and it uses a fixedsleep(3)instead of waiting for readiness. Also, server shutdown won’t run if an earlier assertion fails (notry/finally). Recommended: allocate an ephemeral free port, poll/healthuntil ready (with a deadline), and ensureserver.should_exit+join()happens in afinallyblock so the thread is always stopped.
tests/envs/test_email_triage_http.py:1 - This test is prone to flakiness: it binds a fixed port (
8099) which can conflict in parallel test runs/CI, and it uses a fixedsleep(3)instead of waiting for readiness. Also, server shutdown won’t run if an earlier assertion fails (notry/finally). Recommended: allocate an ephemeral free port, poll/healthuntil ready (with a deadline), and ensureserver.should_exit+join()happens in afinallyblock so the thread is always stopped.
tests/envs/test_email_triage_http.py:1 - The comment says the first hard step “should NOT be done”, but the test never asserts that invariant (it only prints
done_val). Add an assertion thatdata.get("done") is Falsehere so the test actually validates the multi-turn behavior it describes.
space/DEPLOY.md:1 - The deployment instructions reference copying
types.py, but this PR introducesspace/env_types.py(explicitly renamed to avoid shadowing stdlibtypes). As written, following these steps will likely copy the wrong filename and break imports. Update the instructions to match the actual file name/layout used by the Space bundle.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) | ||
|
|
||
| # ── 7. Determine done ──────────────────────────────────────── | ||
| done = self._queue_index >= len(self._queue) |
There was a problem hiding this comment.
queue_position becomes queue_size + 1 on the terminal step because _queue_index is incremented before building info/metadata. This produces incorrect state/UX and can break clients that assume 1 <= queue_position <= queue_size. Consider computing a queue_position value that clamps to len(self._queue) when done=True (and use that consistently for both info and metadata).
| done = self._queue_index >= len(self._queue) | |
| done = self._queue_index >= len(self._queue) | |
| queue_position = min(self._queue_index + 1, len(self._queue)) if self._queue else 0 |
| "queue_size": len(self._queue), | ||
| "queue_position": self._queue_index + 1, |
There was a problem hiding this comment.
queue_position becomes queue_size + 1 on the terminal step because _queue_index is incremented before building info/metadata. This produces incorrect state/UX and can break clients that assume 1 <= queue_position <= queue_size. Consider computing a queue_position value that clamps to len(self._queue) when done=True (and use that consistently for both info and metadata).
| "queue_position": self._queue_index + 1, | ||
| "queue_size": len(self._queue), |
There was a problem hiding this comment.
queue_position becomes queue_size + 1 on the terminal step because _queue_index is incremented before building info/metadata. This produces incorrect state/UX and can break clients that assume 1 <= queue_position <= queue_size. Consider computing a queue_position value that clamps to len(self._queue) when done=True (and use that consistently for both info and metadata).
| deadline = ( | ||
| self._sla_deadlines[self._queue_index] | ||
| if self._queue_index < len(self._sla_deadlines) | ||
| else self._state.step_count | ||
| ) | ||
| if self._state.step_count > deadline: |
There was a problem hiding this comment.
As implemented, each step() always resolves exactly one ticket and advances the queue, so step_count grows as i+1 while the SLA deadlines are (i+1) * sla_steps_per_ticket. That means step_count > deadline will never be true for any sla_steps_per_ticket >= 1, so SLA breaches never occur and the SLA reward signal becomes effectively constant. To make SLA meaningful, either (a) allow multiple steps per ticket (don’t advance the queue on incorrect/unsafe actions), or (b) redefine deadlines relative to global episode budget or per-ticket elapsed steps in a way that can actually be exceeded.
| deadline = ( | |
| self._sla_deadlines[self._queue_index] | |
| if self._queue_index < len(self._sla_deadlines) | |
| else self._state.step_count | |
| ) | |
| if self._state.step_count > deadline: | |
| # `_sla_deadlines` may be stored as cumulative queue deadlines | |
| # (for example, `(i + 1) * sla_steps_per_ticket`). Because each | |
| # step resolves exactly one ticket, comparing the current global | |
| # `step_count` against that cumulative deadline can make breaches | |
| # impossible for any SLA budget >= 1. Derive the current ticket's | |
| # effective per-ticket SLA budget from the cumulative sequence so | |
| # that the elapsed episode steps can meaningfully exceed it. | |
| if self._queue_index < len(self._sla_deadlines): | |
| cumulative_deadline = self._sla_deadlines[self._queue_index] | |
| previous_cumulative_deadline = ( | |
| self._sla_deadlines[self._queue_index - 1] | |
| if self._queue_index > 0 | |
| else 0 | |
| ) | |
| ticket_sla_budget = cumulative_deadline - previous_cumulative_deadline | |
| else: | |
| ticket_sla_budget = self._state.step_count | |
| if self._state.step_count > ticket_sla_budget: |
| # ── 6. Advance queue ───────────────────────────────────────── | ||
| self._queue_index += 1 | ||
| self._state.tickets_resolved += 1 |
There was a problem hiding this comment.
As implemented, each step() always resolves exactly one ticket and advances the queue, so step_count grows as i+1 while the SLA deadlines are (i+1) * sla_steps_per_ticket. That means step_count > deadline will never be true for any sla_steps_per_ticket >= 1, so SLA breaches never occur and the SLA reward signal becomes effectively constant. To make SLA meaningful, either (a) allow multiple steps per ticket (don’t advance the queue on incorrect/unsafe actions), or (b) redefine deadlines relative to global episode budget or per-ticket elapsed steps in a way that can actually be exceeded.
| "nbformat": 4, | ||
| "nbformat_minor": 5 | ||
| } | ||
| { | ||
| "cells": [ |
There was a problem hiding this comment.
This .ipynb file contains two top-level JSON notebook objects concatenated back-to-back (a second { ... } begins at line 139). That makes the notebook invalid JSON and it won’t open in Jupyter/Colab. Remove the duplicate notebook content (or split into separate files) so the file contains exactly one valid notebook JSON object.
| | Pydantic serialization | All wire types are Pydantic models | | ||
| | Rewards inside environment boundary | All graders compute inside `step()` | | ||
| | Client-server separation | `client.py` never imports from `server/` | | ||
| | `SUPPORTS_CONCURRENT_SESSIONS = True` | Stateless across sessions | |
There was a problem hiding this comment.
The README states SUPPORTS_CONCURRENT_SESSIONS = True, but EmailTriageEnvironment sets SUPPORTS_CONCURRENT_SESSIONS = False in code. Please update the README to reflect the actual constant (or update the constant if the README is correct), since this affects how users deploy/run the server.
| | `SUPPORTS_CONCURRENT_SESSIONS = True` | Stateless across sessions | | |
| | `SUPPORTS_CONCURRENT_SESSIONS = False` | Concurrent sessions are not supported by the current environment implementation | |
| # TRL ≥ v1.0 pulls in mergekit at import time; auto-install if missing. | ||
| import subprocess as _sp | ||
| for _pkg in ("mergekit", "fastmcp"): | ||
| try: | ||
| __import__(_pkg) | ||
| except ImportError: | ||
| print(f"[DEP] Installing missing dependency: {_pkg}") | ||
| _sp.check_call([sys.executable, "-m", "pip", "install", "-q", _pkg]) |
There was a problem hiding this comment.
Installing packages at runtime from within the training script introduces reliability and security issues (non-reproducible environments, network failures mid-run, unexpected dependency resolution). Prefer moving these dependencies into an explicit requirements/lockfile or documenting a pip install ... command for Colab; if you must keep this behavior, gate it behind an explicit flag (e.g., --auto-install-deps) so “normal” runs don’t mutate the environment.
Summary
Type of Change
Alignment Checklist
Before submitting, verify:
.claude/docs/PRINCIPLES.mdand this PR aligns with our principles.claude/docs/INVARIANTS.mdand no invariants are violated/pre-submit-pr(orbash .claude/hooks/lint.shand tests) and addressed all issuesRFC Status
Test Plan
Claude Code Review