Skip to content

Personal endpoint docs path + graceful remote timeout handling#58

Merged
rajeeja merged 1 commit into
mainfrom
rajeeja/personal-endpoint-and-timeout-handling
Jun 9, 2026
Merged

Personal endpoint docs path + graceful remote timeout handling#58
rajeeja merged 1 commit into
mainfrom
rajeeja/personal-endpoint-and-timeout-handling

Conversation

@rajeeja

@rajeeja rajeeja commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • README router gets a new row for "HPC user, your own personal endpoint" so solo users aren't routed through the sysadmin path.
  • `docs/operating-an-endpoint.md` leads with a 30-minute personal quickstart (six commands, one config edit) and keeps the full 8-step shared setup below for the team/lab case.
  • `AGENTS.md` adds an upfront-expectation-setting rule for any `use_remote=True` invocation.
  • `remote/health.py`: drop the `with Executor(...) as ex:` context manager and use `try/finally + shutdown(wait=False)`. The old form drained pending futures on exit, causing probes against slow/wedged endpoints to hang for the full timeout (or longer).
  • `tools/remote_tools.py`: wrap the remote call so a `TimeoutError` returns an actionable message ("job likely waiting in Slurm/PBS queue, raise timeout_seconds or wait") instead of a bare traceback.
  • `uv.lock`: catches up to 0.1.1 (already in `pyproject.toml`).

Why

Two pain points hit in real use:

  1. The README router conflated "I'll stand up an endpoint just for me" with "I'm setting up a shared team endpoint with a service account and an allowlist." The latter is a 1+ hr undertaking; the former is 30 min. Routing them together scared off solo users.
  2. When a Globus Compute endpoint's scheduler is backlogged or its worker wedged, the old probe path could hang far longer than `timeout_seconds` and remote tool calls surfaced as cryptic tracebacks.

Test plan

  • `uv run pre-commit run --all-files` (clean)
  • `uv run pytest tests/ --ignore=tests/test_remote_agent.py` (293 passed)
  • `uv run --extra docs sphinx-build -b html -W --keep-going docs docs/_build/html` (clean, after promoting `#### MEP allowlist` to `###` so MyST emits an anchor for it)
  • Manual exercise of the remote timeout path against a live endpoint left for a follow-up PR; the structural change is small and the success path is what existing tests cover.

…andling

Docs:
- README router now distinguishes "your own personal endpoint" (case 3) from
  "shared/group endpoint operator" (case 4). The first was previously folded
  into the sysadmin path and underserved.
- docs/operating-an-endpoint.md opens with a 30-minute solo personal quickstart
  (six commands plus one config edit) before the full 8-step shared setup,
  with clear pointers to when each piece of hardening becomes mandatory.
- AGENTS.md adds an "upfront remote expectation setting" rule so the agent
  surfaces target endpoint, queue characteristics, and timeout to the user
  before invoking any use_remote=True tool.

Code:
- remote/health.py: replace `with Executor(...) as ex:` with explicit
  try/finally + ex.shutdown(wait=False). The context manager drained
  pending futures on exit, which caused probes against a slow or wedged
  endpoint to hang for the full timeout (and sometimes much longer) instead
  of returning.
- tools/remote_tools.py: wrap the remote call in try/except so a TimeoutError
  becomes an actionable message ("job likely waiting in Slurm/PBS queue,
  raise timeout_seconds or wait") instead of a bare traceback.

Lock:
- uv.lock catches up to the 0.1.1 package version already in pyproject.toml.
@rajeeja rajeeja merged commit 4a09fa6 into main Jun 9, 2026
9 checks passed
@rajeeja rajeeja deleted the rajeeja/personal-endpoint-and-timeout-handling branch June 9, 2026 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant