From 39dba28ef13a10279e11a51e0d0e750b1ca3ba20 Mon Sep 17 00:00:00 2001 From: dejaguarkyng Date: Wed, 3 Jun 2026 10:59:37 +0000 Subject: [PATCH 1/5] feat: add Jungle Grid GPU execution agent demo --- .../IMPLEMENTATION_DECISION.md | 57 ++ .../09_jungle_grid_gpu_execution/README.md | 166 +++++ .../agents/jungle_grid_executor.py | 571 ++++++++++++++++++ .../09_jungle_grid_gpu_execution/network.yaml | 89 +++ tests/agents/test_jungle_grid_executor.py | 482 +++++++++++++++ 5 files changed, 1365 insertions(+) create mode 100644 sdk/demos/09_jungle_grid_gpu_execution/IMPLEMENTATION_DECISION.md create mode 100644 sdk/demos/09_jungle_grid_gpu_execution/README.md create mode 100644 sdk/demos/09_jungle_grid_gpu_execution/agents/jungle_grid_executor.py create mode 100644 sdk/demos/09_jungle_grid_gpu_execution/network.yaml create mode 100644 tests/agents/test_jungle_grid_executor.py diff --git a/sdk/demos/09_jungle_grid_gpu_execution/IMPLEMENTATION_DECISION.md b/sdk/demos/09_jungle_grid_gpu_execution/IMPLEMENTATION_DECISION.md new file mode 100644 index 000000000..41375c1d0 --- /dev/null +++ b/sdk/demos/09_jungle_grid_gpu_execution/IMPLEMENTATION_DECISION.md @@ -0,0 +1,57 @@ +# Jungle Grid Integration Decision + +## Selected Extension Point + +This contribution is a runnable demo network with a Python `WorkerAgent`. The agent +uses OpenAgents' project mod for the long-running workflow, project messages for +estimate and lifecycle updates, and project artifacts for logs and Jungle Grid +artifact metadata. + +Jungle Grid is an external execution layer, not an OpenAgents transport, launcher +agent type, or network mod. A demo keeps the integration provider-specific while +showing a reusable OpenAgents pattern: an agent delegates asynchronous compute, +waits for human approval before billable work, and returns results to a shared +project. + +## Rejected Alternatives + +- **Launcher agent type:** Jungle Grid executes workloads; it is not an interactive + coding-agent runtime managed by the launcher. +- **Core provider integration:** No OpenAgents core abstraction requires a + provider-specific compute backend. +- **Jungle Grid mod:** The integration does not add network-wide event semantics or + shared infrastructure. Existing project events already cover the workflow. +- **Hosted MCP entry:** OpenAgents can load external MCP tools, but the current + Streamable HTTP MCP connector does not perform Jungle Grid's hosted OAuth flow or + attach API-key headers. Adding that capability solely for this demo would be a + core architecture change. +- **Local stdio MCP dependency:** The Jungle Grid stdio MCP package is supported, + but a direct Python API client is easier to validate, test, and constrain around + mandatory human approval. It also avoids requiring Node.js for a Python demo. + +## Jungle Grid Contract Used + +The demo uses the documented public execution API: + +- `POST /v1/jobs/estimate` +- `POST /v1/jobs` +- `GET /v1/jobs/{job_id}` +- `GET /v1/jobs/{job_id}/logs` +- `POST /v1/jobs/{job_id}/cancel` +- `GET /v1/jobs/{job_id}/artifacts` +- `POST /v1/jobs/{job_id}/artifacts/{artifact_id}/download` + +Authentication is a scoped server-side API key in `JUNGLE_GRID_API_KEY`. The +documented lifecycle includes `pending`, `queued`, `assigned`, `running`, +`completed`, `failed`, `rejected`, and `cancelled`. + +Workload environment values are not accepted in project goals. A goal may use +`environment_from_env` to reference variables available only in the executor +process; those values are resolved after human approval and are excluded from +the estimate request and project-visible output. + +## Contribution Workflow + +OpenAgents' contributing guide asks contributors to create an issue for feature +suggestions before submitting a pull request. This demo should be proposed in an +issue and held for maintainer direction before a PR is opened. diff --git a/sdk/demos/09_jungle_grid_gpu_execution/README.md b/sdk/demos/09_jungle_grid_gpu_execution/README.md new file mode 100644 index 000000000..a9841b8df --- /dev/null +++ b/sdk/demos/09_jungle_grid_gpu_execution/README.md @@ -0,0 +1,166 @@ +# Jungle Grid GPU Execution Demo + +This demo shows an OpenAgents execution agent delegating long-running AI and GPU +workloads to [Jungle Grid](https://junglegrid.dev), an execution layer that +places and runs AI workloads without requiring agents to manage GPU servers. + +The workflow fits OpenAgents because the workload is asynchronous and +collaborative: an agent estimates the job, a human approves spending in the +shared project, and the agent returns lifecycle updates, logs, and artifact +metadata to the same workspace. + +## Security And Billing Warning + +Jungle Grid jobs may consume credits or incur charges. The executor never submits +a workload when a project starts. It requires an exact approval command from a +human identity after posting the estimate. Keep API keys in environment variables +and do not paste secrets into project goals, messages, logs, metadata, or +committed files. Workloads that need environment values must use +`environment_from_env`; the executor resolves those references only after human +approval, immediately before submission. + +## Prerequisites + +- Python with the OpenAgents development package installed. +- A Jungle Grid account and a scoped API key that can estimate, submit, read, and + cancel jobs. +- A public container image suitable for the requested workload. + +## Environment Variables + +- `JUNGLE_GRID_API_KEY` is required. The agent reads this server-side API key and + sends it only as a Bearer token to Jungle Grid. +- `JUNGLEGRID_API_BASE` optionally overrides the default API base, + `https://api.junglegrid.dev`. +- Any workload-specific variables referenced by `environment_from_env` must also + be exported in the executor process. Their values are never placed in the + project goal or estimate request. + +## Setup + +From the repository root, install OpenAgents with SDK and development +dependencies so the network, agent, and test commands are available: + +```bash +pip install -e ".[sdk,dev]" +``` + +Export the Jungle Grid API key in the shell that will run the executor. This +keeps the credential out of the repository and network configuration: + +```bash +export JUNGLE_GRID_API_KEY="jg_..." +``` + +## Run The Demo + +Start the OpenAgents network from this demo directory. The network enables the +project mod and exposes the `Jungle Grid GPU Execution` project template: + +```bash +cd sdk/demos/09_jungle_grid_gpu_execution +openagents network start network.yaml +``` + +In a second terminal, start the deterministic Python executor. It does not need +an LLM provider key: + +```bash +cd sdk/demos/09_jungle_grid_gpu_execution +python agents/jungle_grid_executor.py +``` + +Open Studio at `http://localhost:8700/studio`, create a project with the +`Jungle Grid GPU Execution` template, and use a JSON object as the project goal. +For example: + +```json +{ + "name": "openagents-batch-demo", + "workload_type": "batch", + "image": "python:3.11-slim", + "command": "python", + "args": ["-c", "print('hello from Jungle Grid')"], + "optimize_for": "cost" +} +``` + +The agent validates the request and calls Jungle Grid's estimate endpoint. It +posts the structured estimate and stores it as project artifact +`jungle_grid_estimate`. No compute has been submitted at this point. + +For a workload that needs a credential or other environment value, export it in +the executor shell and reference only its local variable name in the goal: + +```bash +export MODEL_TOKEN="..." +``` + +```json +{ + "name": "openagents-inference-demo", + "workload_type": "inference", + "image": "example/model-server:latest", + "environment_from_env": { + "MODEL_TOKEN": "MODEL_TOKEN" + }, + "optimize_for": "cost" +} +``` + +The mapping key is the variable sent to the workload, and the mapping value is +the local executor variable to resolve. Literal `environment` values, API keys, +Bearer tokens, and secret-like metadata keys are rejected. + +Review the estimate, then reply in the project with the exact command shown by +the agent. Estimates that explicitly report `available: false` or +`can_submit: false` cannot be approved: + +```text +APPROVE +``` + +After approval, the agent submits the workload, posts status changes such as +submitted, queued, assigned/provisioning, running, completed, failed, rejected, +or cancelled, and stores the final job details, logs, artifact list, and +temporary download metadata in project artifact `jungle_grid_result`. + +To cancel a submitted job, reply with the exact job ID: + +```text +CANCEL +``` + +Cancellation is explicit and only applies when the job ID matches the project. +Only a human identity can request cancellation. The agent reports cancellation +failures without exposing the API key. + +## Failure Behavior + +Invalid workload JSON, missing required fields, missing API keys, timeouts, +invalid Jungle Grid responses, and API errors are posted to the project in +sanitized form. Failed, rejected, or cancelled jobs stop the OpenAgents project. +Completed jobs complete the project. + +## Tests + +Run the focused mocked tests. They do not contact Jungle Grid or submit paid +work: + +```bash +pytest tests/agents/test_jungle_grid_executor.py +``` + +Run the repository formatter and linter checks used by the Python project: + +```bash +ruff format --check sdk/demos/09_jungle_grid_gpu_execution tests/agents/test_jungle_grid_executor.py +ruff check sdk/demos/09_jungle_grid_gpu_execution tests/agents/test_jungle_grid_executor.py +``` + +## Optional Live Estimate + +The normal demo performs a live estimate when a project starts, but it never +automatically submits a job. Use a low-cost workload goal, review the estimate in +the project, and do not send the approval command unless you explicitly intend +to start billable compute. diff --git a/sdk/demos/09_jungle_grid_gpu_execution/agents/jungle_grid_executor.py b/sdk/demos/09_jungle_grid_gpu_execution/agents/jungle_grid_executor.py new file mode 100644 index 000000000..6e8a369c2 --- /dev/null +++ b/sdk/demos/09_jungle_grid_gpu_execution/agents/jungle_grid_executor.py @@ -0,0 +1,571 @@ +#!/usr/bin/env python3 +"""Jungle Grid execution agent for the OpenAgents project workflow demo.""" + +import asyncio +import json +import logging +import os +import re +import uuid +from dataclasses import dataclass +from typing import Any, Dict, Iterable, Optional +from urllib.parse import quote + +import aiohttp + +from openagents.agents.worker_agent import WorkerAgent, on_event +from openagents.models.event_context import EventContext +from openagents.mods.workspace.project import DefaultProjectAgentAdapter + +logger = logging.getLogger(__name__) + +DEFAULT_API_BASE = "https://api.junglegrid.dev" +TERMINAL_STATUSES = {"completed", "failed", "rejected", "cancelled"} +VALID_WORKLOAD_TYPES = {"inference", "training", "fine-tuning", "batch"} +VALID_OPTIMIZE_FOR = {"balanced", "cost", "speed"} +SUBMIT_FIELDS = { + "name", + "workload_type", + "image", + "command", + "args", + "environment_from_env", + "optimize_for", + "template", + "metadata", +} +ESTIMATE_FIELDS = { + "workload_type", + "image", + "command", + "args", + "optimize_for", + "template", +} +SENSITIVE_PATTERN = re.compile(r"(?i)(bearer\s+)[^\s,;]+|jg_[A-Za-z0-9_-]+") +SENSITIVE_KEY_PATTERN = re.compile(r"(?i)(api[_-]?key|authorization|password|secret|token)") + + +class JungleGridError(Exception): + """Sanitized Jungle Grid client error.""" + + def __init__(self, code: str, message: str, status: Optional[int] = None): + super().__init__(message) + self.code = code + self.status = status + + +def redact_sensitive(value: Any, secret: Optional[str] = None) -> str: + """Return a log-safe string with credentials removed.""" + text = str(value) + if secret: + text = text.replace(secret, "[REDACTED]") + return SENSITIVE_PATTERN.sub(lambda match: f"{match.group(1) or ''}[REDACTED]", text) + + +def _collect_string_values(value: Any) -> list[str]: + """Collect nested string values that must not be exposed in project output.""" + if isinstance(value, str): + return [value] if value else [] + if isinstance(value, dict): + strings = [] + for nested in value.values(): + strings.extend(_collect_string_values(nested)) + return strings + if isinstance(value, list): + strings = [] + for nested in value: + strings.extend(_collect_string_values(nested)) + return strings + return [] + + +def _contains_sensitive_key(value: Any) -> bool: + """Return whether nested data uses a key commonly associated with credentials.""" + if isinstance(value, dict): + return any( + SENSITIVE_KEY_PATTERN.search(str(key)) or _contains_sensitive_key(nested) for key, nested in value.items() + ) + if isinstance(value, list): + return any(_contains_sensitive_key(nested) for nested in value) + return False + + +def sanitize_project_data(value: Any, secrets: Iterable[str]) -> Any: + """Recursively redact credentials and workload-provided secret values.""" + secret_values = [secret for secret in secrets if secret] + if isinstance(value, str): + result = value + for secret in secret_values: + result = result.replace(secret, "[REDACTED]") + return redact_sensitive(result) + if isinstance(value, dict): + return {key: sanitize_project_data(nested, secret_values) for key, nested in value.items()} + if isinstance(value, list): + return [sanitize_project_data(nested, secret_values) for nested in value] + return value + + +def _unwrap_response(data: Any) -> Any: + if isinstance(data, dict) and data.get("ok") is True and "data" in data: + return data["data"] + return data + + +def _error_detail(data: Any, status: int) -> tuple[str, str]: + if isinstance(data, dict): + nested = data.get("error") + if isinstance(nested, dict): + return str(nested.get("code") or "API_ERROR"), str(nested.get("message") or f"HTTP {status}") + return str(data.get("code") or "API_ERROR"), str(data.get("message") or f"HTTP {status}") + return "API_ERROR", f"HTTP {status}" + + +class JungleGridClient: + """Small async client for Jungle Grid's documented public execution API.""" + + def __init__( + self, + api_base: Optional[str] = None, + timeout_seconds: float = 30.0, + ): + raw_api_base = api_base if api_base is not None else os.getenv("JUNGLEGRID_API_BASE", DEFAULT_API_BASE) + self.api_key = os.getenv("JUNGLE_GRID_API_KEY", "").strip() + self.api_base = raw_api_base.rstrip("/") + self.timeout_seconds = timeout_seconds + + def _require_api_key(self) -> str: + if not self.api_key: + raise JungleGridError("MISSING_API_KEY", "JUNGLE_GRID_API_KEY is required.") + return self.api_key + + async def _request(self, method: str, path: str, payload: Optional[Dict[str, Any]] = None) -> Dict[str, Any]: + api_key = self._require_api_key() + timeout = aiohttp.ClientTimeout(total=self.timeout_seconds) + headers = { + "Accept": "application/json", + "Authorization": f"Bearer {api_key}", + "Content-Type": "application/json", + } + try: + async with aiohttp.ClientSession(timeout=timeout) as session: + async with session.request(method, f"{self.api_base}{path}", headers=headers, json=payload) as response: + text = await response.text() + try: + data = json.loads(text) if text.strip() else {} + except json.JSONDecodeError as exc: + raise JungleGridError( + "INVALID_API_RESPONSE", "Jungle Grid returned invalid JSON.", response.status + ) from exc + if response.status < 200 or response.status >= 300: + code, message = _error_detail(data, response.status) + raise JungleGridError(code, redact_sensitive(message, api_key), response.status) + except asyncio.TimeoutError as exc: + raise JungleGridError("NETWORK_TIMEOUT", "Jungle Grid request timed out.") from exc + except aiohttp.ClientError as exc: + raise JungleGridError("NETWORK_ERROR", redact_sensitive(exc, api_key)) from exc + + result = _unwrap_response(data) + if not isinstance(result, dict): + raise JungleGridError("INVALID_API_RESPONSE", "Jungle Grid returned an unexpected response shape.") + return result + + async def estimate_job(self, workload: Dict[str, Any]) -> Dict[str, Any]: + return await self._request("POST", "/v1/jobs/estimate", workload) + + async def submit_job(self, workload: Dict[str, Any]) -> Dict[str, Any]: + return await self._request("POST", "/v1/jobs", workload) + + async def get_job(self, job_id: str) -> Dict[str, Any]: + return await self._request("GET", f"/v1/jobs/{quote(job_id, safe='')}") + + async def get_job_logs(self, job_id: str) -> Dict[str, Any]: + return await self._request("GET", f"/v1/jobs/{quote(job_id, safe='')}/logs") + + async def cancel_job(self, job_id: str, reason: str) -> Dict[str, Any]: + return await self._request("POST", f"/v1/jobs/{quote(job_id, safe='')}/cancel", {"reason": reason}) + + async def list_artifacts(self, job_id: str) -> Dict[str, Any]: + return await self._request("GET", f"/v1/jobs/{quote(job_id, safe='')}/artifacts") + + async def get_artifact(self, job_id: str, artifact_id: str) -> Dict[str, Any]: + return await self._request( + "POST", + f"/v1/jobs/{quote(job_id, safe='')}/artifacts/{quote(artifact_id, safe='')}/download", + ) + + +def parse_workload_goal(goal: str) -> Dict[str, Any]: + """Parse and validate a project goal containing a Jungle Grid workload JSON object.""" + text = goal.strip() + if text.startswith("```"): + text = re.sub(r"^```(?:json)?\s*", "", text) + text = re.sub(r"\s*```$", "", text) + try: + workload = json.loads(text) + except json.JSONDecodeError as exc: + raise ValueError("Project goal must be a JSON object describing the Jungle Grid workload.") from exc + if not isinstance(workload, dict): + raise ValueError("Project goal must be a JSON object.") + if SENSITIVE_PATTERN.search(json.dumps(workload)): + raise ValueError("Workload must not contain API keys or Bearer tokens.") + + unsupported = sorted(set(workload) - SUBMIT_FIELDS) + if unsupported: + raise ValueError(f"Unsupported workload fields: {', '.join(unsupported)}.") + required = {"name", "workload_type", "image"} + missing = sorted(key for key in required if not isinstance(workload.get(key), str) or not workload[key].strip()) + if missing: + raise ValueError(f"Missing required workload fields: {', '.join(missing)}.") + if workload["workload_type"] not in VALID_WORKLOAD_TYPES: + raise ValueError(f"workload_type must be one of: {', '.join(sorted(VALID_WORKLOAD_TYPES))}.") + if "optimize_for" in workload and workload["optimize_for"] not in VALID_OPTIMIZE_FOR: + raise ValueError(f"optimize_for must be one of: {', '.join(sorted(VALID_OPTIMIZE_FOR))}.") + if "args" in workload and not ( + isinstance(workload["args"], list) and all(isinstance(item, str) for item in workload["args"]) + ): + raise ValueError("args must be an array of strings.") + if "environment_from_env" in workload and not ( + isinstance(workload["environment_from_env"], dict) + and all( + isinstance(key, str) and key.strip() and isinstance(value, str) and value.strip() + for key, value in workload["environment_from_env"].items() + ) + ): + raise ValueError("environment_from_env must map workload variable names to local environment variable names.") + if _contains_sensitive_key(workload.get("metadata")): + raise ValueError("metadata must not contain secret-like keys.") + return workload + + +def build_estimate_payload(workload: Dict[str, Any]) -> Dict[str, Any]: + """Build an estimate-only payload without submit-only or secret-bearing fields.""" + return {key: value for key, value in workload.items() if key in ESTIMATE_FIELDS} + + +def build_submit_payload(workload: Dict[str, Any]) -> Dict[str, Any]: + """Build a submit payload, resolving secret environment values only at submission time.""" + payload = {key: value for key, value in workload.items() if key != "environment_from_env"} + references = workload.get("environment_from_env") + if not references: + return payload + + missing = sorted(env_name for env_name in references.values() if not os.getenv(env_name)) + if missing: + raise ValueError(f"Missing required local environment variables: {', '.join(missing)}.") + payload["environment"] = {name: os.environ[env_name] for name, env_name in references.items()} + return payload + + +def public_workload(workload: Dict[str, Any]) -> Dict[str, Any]: + """Return workload metadata safe to share in a project message or artifact.""" + result = dict(workload) + if "metadata" in result: + metadata = result["metadata"] + result["metadata"] = {key: "[REDACTED]" for key in metadata} if isinstance(metadata, dict) else "[REDACTED]" + return result + + +def lifecycle_label(status: str) -> str: + """Map Jungle Grid status to a user-facing lifecycle label.""" + if status == "assigned": + return "assigned (provisioning)" + return status + + +def estimate_can_submit(estimate: Dict[str, Any]) -> bool: + """Return whether an estimate explicitly permits submission.""" + return estimate.get("available") is not False and estimate.get("can_submit") is not False + + +@dataclass +class ProjectExecution: + """State tracked between estimate, approval, submission, and completion.""" + + project_id: str + workload: Dict[str, Any] + estimate_id: str + estimate: Dict[str, Any] + job_id: Optional[str] = None + last_status: Optional[str] = None + approved_by: Optional[str] = None + submission_started: bool = False + submit_payload: Optional[Dict[str, Any]] = None + secret_values: Optional[list[str]] = None + + +class JungleGridExecutorAgent(WorkerAgent): + """Execute approved Jungle Grid workloads and report results to an OpenAgents project.""" + + default_agent_id = "jungle-grid-executor" + + def __init__( + self, + jungle_grid_client: Optional[JungleGridClient] = None, + poll_interval_seconds: float = 10.0, + **kwargs: Any, + ): + super().__init__(**kwargs) + self.jungle_grid = jungle_grid_client or JungleGridClient() + self.poll_interval_seconds = poll_interval_seconds + self.project_adapter = DefaultProjectAgentAdapter() + self.executions: Dict[str, ProjectExecution] = {} + self.monitor_tasks: Dict[str, asyncio.Task] = {} + + async def on_startup(self): + """Bind the project adapter after the OpenAgents client is connected.""" + self.project_adapter.bind_client(self.client) + self.project_adapter.bind_connector(self.client.connector) + self.project_adapter.bind_agent(self.agent_id) + logger.info("Jungle Grid executor is ready") + + async def on_shutdown(self): + """Stop local monitor tasks without cancelling remote jobs.""" + for task in self.monitor_tasks.values(): + task.cancel() + if self.monitor_tasks: + await asyncio.gather(*self.monitor_tasks.values(), return_exceptions=True) + + async def _post(self, project_id: str, text: str): + await self.project_adapter.send_project_message(project_id=project_id, content={"text": text}) + + async def _set_artifact(self, project_id: str, key: str, value: Dict[str, Any]): + await self.project_adapter.set_project_artifact( + project_id=project_id, key=key, value=json.dumps(value, indent=2) + ) + + def _project_secrets(self, execution: ProjectExecution) -> list[str]: + return [ + self.jungle_grid.api_key, + *(execution.secret_values or []), + *_collect_string_values(execution.workload.get("metadata")), + ] + + def _sanitize_for_project(self, value: Any, execution: ProjectExecution) -> Any: + return sanitize_project_data(value, self._project_secrets(execution)) + + def _is_human_approver(self, sender_id: str) -> bool: + return sender_id.startswith("human:") + + @on_event("project.notification.started") + async def handle_project_started(self, context: EventContext): + """Estimate a workload and request human approval without submitting it.""" + payload = context.incoming_event.payload + project_id = payload.get("project_id") + goal = payload.get("goal", "") + if not project_id: + return + try: + workload = parse_workload_goal(goal) + estimate = await self.jungle_grid.estimate_job(build_estimate_payload(workload)) + estimate_id = uuid.uuid4().hex[:12] + execution = ProjectExecution(project_id, workload, estimate_id, estimate) + self.executions[project_id] = execution + shared_workload = self._sanitize_for_project(public_workload(workload), execution) + shared_estimate = self._sanitize_for_project(estimate, execution) + await self._set_artifact( + project_id, + "jungle_grid_estimate", + {"estimate_id": estimate_id, "workload": shared_workload, "estimate": shared_estimate}, + ) + if not estimate_can_submit(estimate): + await self._post( + project_id, + "Jungle Grid estimate is not currently eligible for submission.\n\n" + f"```json\n{json.dumps({'estimate_id': estimate_id, 'workload': shared_workload, 'estimate': shared_estimate}, indent=2)}\n```", + ) + await self.project_adapter.stop_project( + project_id=project_id, reason="Jungle Grid estimate is not eligible for submission" + ) + return + await self._post( + project_id, + "Jungle Grid estimate ready. No job has been submitted.\n\n" + f"```json\n{json.dumps({'estimate_id': estimate_id, 'workload': shared_workload, 'estimate': shared_estimate}, indent=2)}\n```\n\n" + f"A human must reply exactly `APPROVE {estimate_id}` before billable compute can start.", + ) + except (ValueError, JungleGridError) as exc: + await self._post( + project_id, f"Jungle Grid estimate failed: {redact_sensitive(exc, self.jungle_grid.api_key)}" + ) + await self.project_adapter.stop_project(project_id=project_id, reason="Jungle Grid estimate failed") + + @on_event("project.notification.message_received") + async def handle_project_message(self, context: EventContext): + """Handle explicit approval and cancellation commands.""" + payload = context.incoming_event.payload + project_id = payload.get("project_id") + sender_id = str(payload.get("sender_id", "")) + content = payload.get("content", {}) + text = content.get("text", "") if isinstance(content, dict) else "" + if not project_id or not isinstance(text, str): + return + command = text + execution = self.executions.get(project_id) + + if command.startswith("APPROVE "): + if not execution: + await self._post(project_id, "There is no pending Jungle Grid estimate for this project.") + return + if not self._is_human_approver(sender_id): + await self._post( + project_id, "Approval rejected: billable Jungle Grid submission requires a human approver." + ) + return + if command != f"APPROVE {execution.estimate_id}": + await self._post(project_id, "Approval rejected: estimate id does not match the pending estimate.") + return + if execution.submission_started: + suffix = f" as job `{execution.job_id}`" if execution.job_id else "" + await self._post(project_id, f"Jungle Grid submission has already been requested{suffix}.") + return + await self._submit_and_monitor(execution, sender_id) + return + + if command.startswith("CANCEL "): + if not execution or not execution.job_id: + await self._post(project_id, "There is no submitted Jungle Grid job to cancel for this project.") + return + if command != f"CANCEL {execution.job_id}": + await self._post(project_id, "Cancellation rejected: job id does not match this project.") + return + if not self._is_human_approver(sender_id): + await self._post( + project_id, "Cancellation rejected: Jungle Grid cancellation requires a human approver." + ) + return + try: + result = await self.jungle_grid.cancel_job( + execution.job_id, f"Requested from OpenAgents by {sender_id}" + ) + shared_result = self._sanitize_for_project(result, execution) + await self._post( + project_id, + f"Cancellation requested for Jungle Grid job `{execution.job_id}`.\n\n```json\n{json.dumps(shared_result, indent=2)}\n```", + ) + except JungleGridError as exc: + await self._post( + project_id, f"Jungle Grid cancellation failed: {redact_sensitive(exc, self.jungle_grid.api_key)}" + ) + + async def _submit_and_monitor(self, execution: ProjectExecution, approved_by: str): + execution.submission_started = True + execution.approved_by = approved_by + try: + execution.submit_payload = build_submit_payload(execution.workload) + execution.secret_values = _collect_string_values(execution.submit_payload.get("environment")) + result = await self.jungle_grid.submit_job(execution.submit_payload) + job_id = str(result.get("job_id") or result.get("id") or "").strip() + if not job_id: + raise JungleGridError("INVALID_API_RESPONSE", "Jungle Grid submit response did not include a job id.") + execution.job_id = job_id + execution.last_status = str(result.get("status") or "submitted") + await self._set_artifact( + execution.project_id, + "jungle_grid_submission", + { + "approved_by": approved_by, + "estimate_id": execution.estimate_id, + "submission": self._sanitize_for_project(result, execution), + }, + ) + await self._post( + execution.project_id, + f"Jungle Grid job submitted after approval by `{approved_by}`: `{job_id}` " + f"(status: `{lifecycle_label(execution.last_status)}`).", + ) + task = asyncio.create_task(self._monitor(execution)) + self.monitor_tasks[execution.project_id] = task + except (ValueError, JungleGridError) as exc: + await self._post( + execution.project_id, + f"Jungle Grid submission failed: {redact_sensitive(exc, self.jungle_grid.api_key)}", + ) + await self.project_adapter.stop_project( + project_id=execution.project_id, reason="Jungle Grid submission failed" + ) + + async def _monitor(self, execution: ProjectExecution): + assert execution.job_id + try: + while True: + job = await self.jungle_grid.get_job(execution.job_id) + status = str(job.get("status") or "unknown") + if status != execution.last_status: + execution.last_status = status + await self._post( + execution.project_id, + f"Jungle Grid job `{execution.job_id}` is now `{lifecycle_label(status)}`.", + ) + if status in TERMINAL_STATUSES: + await self._finalize(execution, job) + return + await asyncio.sleep(self.poll_interval_seconds) + except JungleGridError as exc: + await self._post( + execution.project_id, + f"Jungle Grid monitoring failed: {redact_sensitive(exc, self.jungle_grid.api_key)}", + ) + await self.project_adapter.stop_project( + project_id=execution.project_id, reason="Jungle Grid monitoring failed" + ) + finally: + self.monitor_tasks.pop(execution.project_id, None) + + async def _finalize(self, execution: ProjectExecution, job: Dict[str, Any]): + assert execution.job_id + logs: Dict[str, Any] = {} + artifacts: Dict[str, Any] = {} + downloads = [] + try: + logs = await self.jungle_grid.get_job_logs(execution.job_id) + except JungleGridError as exc: + logs = {"error": redact_sensitive(exc, self.jungle_grid.api_key)} + try: + artifacts = await self.jungle_grid.list_artifacts(execution.job_id) + for artifact in artifacts.get("artifacts", []): + if not isinstance(artifact, dict): + continue + artifact_id = str(artifact.get("artifact_id") or artifact.get("id") or "").strip() + if artifact_id: + downloads.append(await self.jungle_grid.get_artifact(execution.job_id, artifact_id)) + except JungleGridError as exc: + artifacts = {"error": redact_sensitive(exc, self.jungle_grid.api_key)} + + result = self._sanitize_for_project( + {"job": job, "logs": logs, "artifacts": artifacts, "downloads": downloads}, + execution, + ) + await self._set_artifact(execution.project_id, "jungle_grid_result", result) + status = str(job.get("status") or "unknown") + await self._post( + execution.project_id, + f"Jungle Grid job `{execution.job_id}` finished with status `{status}`. " + "Logs and artifact metadata are stored in project artifact `jungle_grid_result`.", + ) + if status == "completed": + await self.project_adapter.complete_project( + project_id=execution.project_id, + summary=f"Jungle Grid job {execution.job_id} completed successfully.", + ) + else: + await self.project_adapter.stop_project( + project_id=execution.project_id, + reason=f"Jungle Grid job {execution.job_id} finished with status {status}.", + ) + + +async def main(): + """Run the Jungle Grid executor agent.""" + logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s") + agent = JungleGridExecutorAgent() + try: + await agent.async_start(network_host="localhost", network_port=8700) + while True: + await asyncio.sleep(3600) + finally: + await agent.async_stop() + + +if __name__ == "__main__": + asyncio.run(main()) diff --git a/sdk/demos/09_jungle_grid_gpu_execution/network.yaml b/sdk/demos/09_jungle_grid_gpu_execution/network.yaml new file mode 100644 index 000000000..dd935306c --- /dev/null +++ b/sdk/demos/09_jungle_grid_gpu_execution/network.yaml @@ -0,0 +1,89 @@ +network: + name: JungleGridGPUExecution + mode: centralized + node_id: jungle-grid-gpu-execution-1 + initialized: true + transports: + - type: http + config: + port: 8700 + serve_studio: true + serve_mcp: true + - type: grpc + config: + port: 8600 + manifest_transport: http + recommended_transport: grpc + encryption_enabled: false + default_agent_group: guest + requires_password: false + agent_groups: + executors: + description: Agents allowed to execute Jungle Grid project workflows + metadata: + permissions: + - execute_external_compute + agents: + - jungle-grid-executor + mods: + - name: openagents.mods.workspace.default + enabled: true + config: + custom_events_enabled: true + - name: openagents.mods.workspace.project + enabled: true + config: + max_concurrent_projects: 5 + project_templates: + jungle_grid_execution: + name: Jungle Grid GPU Execution + description: Estimate, approve, execute, and monitor an AI workload on Jungle Grid + expose_as_tool: true + tool_name: run_jungle_grid_workload + tool_description: Start a Jungle Grid workload project. The task must be a JSON object with name, workload_type, and image; use environment_from_env for workload environment values. + tool_mode: async + agent_groups: + - executors + context: | + This project delegates a long-running AI or GPU workload to Jungle Grid. + The executor estimates cost first and will not submit a job until a human + replies with the exact approval command shown in the project. Do not put + credentials in the goal; use environment_from_env to reference variables + available only in the executor process. + created_by_version: 0.9.3 + +network_profile: + discoverable: true + name: Jungle Grid GPU Execution + description: A demo of human-approved asynchronous AI and GPU workload delegation through Jungle Grid. + tags: + - demo + - jungle-grid + - gpu + - execution + - project + categories: + - demo + - workflow + country: Worldwide + required_openagents_version: 0.9.3 + capacity: 10 + authentication: + type: none + host: 0.0.0.0 + port: 8700 + +log_level: INFO +data_dir: ./data/jungle-grid-gpu-execution +runtime_limit: null +shutdown_timeout: 30 + +external_access: + default_agent_group: guest + auth_token: null + auth_token_env: null + instruction: null + exposed_tools: + - start_run_jungle_grid_workload + - get_result_run_jungle_grid_workload + excluded_tools: [] diff --git a/tests/agents/test_jungle_grid_executor.py b/tests/agents/test_jungle_grid_executor.py new file mode 100644 index 000000000..288cc8fce --- /dev/null +++ b/tests/agents/test_jungle_grid_executor.py @@ -0,0 +1,482 @@ +"""Mocked tests for the Jungle Grid GPU execution demo agent.""" + +import asyncio +import importlib.util +import json +from pathlib import Path +from unittest.mock import AsyncMock + +import pytest + +from openagents.models.event import Event +from openagents.models.event_context import EventContext + +MODULE_PATH = ( + Path(__file__).parent.parent.parent + / "sdk" + / "demos" + / "09_jungle_grid_gpu_execution" + / "agents" + / "jungle_grid_executor.py" +) +SPEC = importlib.util.spec_from_file_location("jungle_grid_executor", MODULE_PATH) +MODULE = importlib.util.module_from_spec(SPEC) +assert SPEC and SPEC.loader +SPEC.loader.exec_module(MODULE) + +JungleGridClient = MODULE.JungleGridClient +JungleGridError = MODULE.JungleGridError +JungleGridExecutorAgent = MODULE.JungleGridExecutorAgent +ProjectExecution = MODULE.ProjectExecution +build_estimate_payload = MODULE.build_estimate_payload +build_submit_payload = MODULE.build_submit_payload +estimate_can_submit = MODULE.estimate_can_submit +lifecycle_label = MODULE.lifecycle_label +parse_workload_goal = MODULE.parse_workload_goal +public_workload = MODULE.public_workload +redact_sensitive = MODULE.redact_sensitive +sanitize_project_data = MODULE.sanitize_project_data + + +def context(event_name, payload): + return EventContext( + incoming_event=Event(event_name=event_name, source_id="system", payload=payload), + event_threads={}, + incoming_thread_id="thread-1", + ) + + +def workload(): + return { + "name": "batch-demo", + "workload_type": "batch", + "image": "python:3.11-slim", + "command": "python", + "args": ["-c", "print(42)"], + "optimize_for": "cost", + } + + +class FakeJungleGridClient: + def __init__(self): + self.api_key = "test-api-key" + self.estimate_job = AsyncMock(return_value={"available": True, "estimated_cost_usd": {"min": 0.1, "max": 0.2}}) + self.submit_job = AsyncMock(return_value={"job_id": "job_123", "status": "queued"}) + self.get_job = AsyncMock(return_value={"job_id": "job_123", "status": "completed"}) + self.get_job_logs = AsyncMock(return_value={"items": [{"message": "done"}]}) + self.cancel_job = AsyncMock(return_value={"job_id": "job_123", "status": "cancelled", "cancelled": True}) + self.list_artifacts = AsyncMock( + return_value={"artifacts": [{"artifact_id": "artifact_1", "filename": "output.json"}]} + ) + self.get_artifact = AsyncMock( + return_value={ + "artifact": {"artifact_id": "artifact_1", "filename": "output.json"}, + "url": "https://example.test/file", + } + ) + + +def agent_with_mocks(fake=None): + agent = JungleGridExecutorAgent(jungle_grid_client=fake or FakeJungleGridClient(), poll_interval_seconds=0) + agent.project_adapter = AsyncMock() + agent.project_adapter.send_project_message = AsyncMock(return_value={"success": True}) + agent.project_adapter.set_project_artifact = AsyncMock(return_value={"success": True}) + agent.project_adapter.complete_project = AsyncMock(return_value={"success": True}) + agent.project_adapter.stop_project = AsyncMock(return_value={"success": True}) + return agent + + +@pytest.mark.asyncio +async def test_successful_estimate_flow_posts_estimate_and_requires_approval(): + fake = FakeJungleGridClient() + agent = agent_with_mocks(fake) + + await agent.handle_project_started( + context("project.notification.started", {"project_id": "project-1", "goal": json.dumps(workload())}) + ) + + fake.estimate_job.assert_awaited_once_with(build_estimate_payload(workload())) + fake.submit_job.assert_not_awaited() + assert "project-1" in agent.executions + message = agent.project_adapter.send_project_message.await_args.kwargs["content"]["text"] + assert "No job has been submitted" in message + assert "APPROVE" in message + + +@pytest.mark.asyncio +async def test_unavailable_estimate_never_requests_approval_or_submits(): + fake = FakeJungleGridClient() + fake.estimate_job = AsyncMock(return_value={"available": False, "can_submit": False}) + agent = agent_with_mocks(fake) + + await agent.handle_project_started( + context("project.notification.started", {"project_id": "project-1", "goal": json.dumps(workload())}) + ) + + fake.submit_job.assert_not_awaited() + message = agent.project_adapter.send_project_message.await_args.kwargs["content"]["text"] + assert "not currently eligible for submission" in message + assert "APPROVE" not in message + agent.project_adapter.stop_project.assert_awaited_once() + + +@pytest.mark.asyncio +async def test_approval_required_before_submit_and_non_human_is_rejected(): + fake = FakeJungleGridClient() + agent = agent_with_mocks(fake) + execution = ProjectExecution("project-1", workload(), "estimate-1", {"available": True}) + agent.executions["project-1"] = execution + + await agent.handle_project_message( + context( + "project.notification.message_received", + {"project_id": "project-1", "sender_id": "agent:other", "content": {"text": "APPROVE estimate-1"}}, + ) + ) + + fake.submit_job.assert_not_awaited() + assert ( + "requires a human approver" in agent.project_adapter.send_project_message.await_args.kwargs["content"]["text"] + ) + + +@pytest.mark.asyncio +@pytest.mark.parametrize("command", ["APPROVE estimate-2", " APPROVE estimate-1", "APPROVE estimate-1\n"]) +async def test_approval_requires_exact_command(command): + fake = FakeJungleGridClient() + agent = agent_with_mocks(fake) + execution = ProjectExecution("project-1", workload(), "estimate-1", {"available": True}) + agent.executions["project-1"] = execution + + await agent.handle_project_message( + context( + "project.notification.message_received", + {"project_id": "project-1", "sender_id": "human:user", "content": {"text": command}}, + ) + ) + + fake.submit_job.assert_not_awaited() + + +@pytest.mark.asyncio +async def test_approved_submit_flow_starts_monitor(): + fake = FakeJungleGridClient() + agent = agent_with_mocks(fake) + execution = ProjectExecution("project-1", workload(), "estimate-1", {"available": True}) + agent.executions["project-1"] = execution + agent._monitor = AsyncMock() + + await agent.handle_project_message( + context( + "project.notification.message_received", + {"project_id": "project-1", "sender_id": "human:user", "content": {"text": "APPROVE estimate-1"}}, + ) + ) + await asyncio.sleep(0) + + fake.submit_job.assert_awaited_once_with(workload()) + assert execution.job_id == "job_123" + agent._monitor.assert_awaited_once_with(execution) + + +@pytest.mark.asyncio +async def test_concurrent_matching_approvals_submit_only_once(): + fake = FakeJungleGridClient() + submit_started = asyncio.Event() + release_submit = asyncio.Event() + + async def delayed_submit(_workload): + submit_started.set() + await release_submit.wait() + return {"job_id": "job_123", "status": "queued"} + + fake.submit_job = AsyncMock(side_effect=delayed_submit) + agent = agent_with_mocks(fake) + agent._monitor = AsyncMock() + execution = ProjectExecution("project-1", workload(), "estimate-1", {"available": True}) + agent.executions["project-1"] = execution + approval = context( + "project.notification.message_received", + {"project_id": "project-1", "sender_id": "human:user", "content": {"text": "APPROVE estimate-1"}}, + ) + + first = asyncio.create_task(agent.handle_project_message(approval)) + await submit_started.wait() + await agent.handle_project_message(approval) + release_submit.set() + await first + await asyncio.sleep(0) + + fake.submit_job.assert_awaited_once_with(workload()) + + +@pytest.mark.asyncio +async def test_status_polling_posts_updates_and_completes(): + fake = FakeJungleGridClient() + fake.get_job = AsyncMock( + side_effect=[ + {"job_id": "job_123", "status": "running"}, + {"job_id": "job_123", "status": "completed"}, + ] + ) + agent = agent_with_mocks(fake) + execution = ProjectExecution("project-1", workload(), "estimate-1", {}, job_id="job_123", last_status="queued") + + await agent._monitor(execution) + + texts = [call.kwargs["content"]["text"] for call in agent.project_adapter.send_project_message.await_args_list] + assert any("`running`" in text for text in texts) + assert any("`completed`" in text for text in texts) + agent.project_adapter.complete_project.assert_awaited_once() + + +@pytest.mark.asyncio +async def test_failed_workload_stops_project(): + fake = FakeJungleGridClient() + agent = agent_with_mocks(fake) + execution = ProjectExecution("project-1", workload(), "estimate-1", {}, job_id="job_123") + + await agent._finalize(execution, {"job_id": "job_123", "status": "failed"}) + + agent.project_adapter.stop_project.assert_awaited_once() + agent.project_adapter.complete_project.assert_not_awaited() + + +@pytest.mark.asyncio +async def test_logs_and_artifacts_are_stored_in_project_artifact(): + fake = FakeJungleGridClient() + agent = agent_with_mocks(fake) + execution = ProjectExecution("project-1", workload(), "estimate-1", {}, job_id="job_123") + + await agent._finalize(execution, {"job_id": "job_123", "status": "completed"}) + + fake.get_job_logs.assert_awaited_once_with("job_123") + fake.list_artifacts.assert_awaited_once_with("job_123") + fake.get_artifact.assert_awaited_once_with("job_123", "artifact_1") + artifact_call = agent.project_adapter.set_project_artifact.await_args + assert artifact_call.kwargs["key"] == "jungle_grid_result" + assert "output.json" in artifact_call.kwargs["value"] + + +@pytest.mark.asyncio +async def test_resolved_environment_values_are_redacted_from_results(monkeypatch): + monkeypatch.setenv("MODEL_TOKEN", "secret-value") + fake = FakeJungleGridClient() + fake.get_job_logs = AsyncMock(return_value={"items": [{"message": "token=secret-value"}]}) + agent = agent_with_mocks(fake) + requested = {**workload(), "environment_from_env": {"MODEL_TOKEN": "MODEL_TOKEN"}} + execution = ProjectExecution( + "project-1", + requested, + "estimate-1", + {}, + job_id="job_123", + submit_payload=build_submit_payload(requested), + secret_values=["secret-value"], + ) + + await agent._finalize(execution, {"job_id": "job_123", "status": "completed"}) + + artifact_value = agent.project_adapter.set_project_artifact.await_args.kwargs["value"] + assert "secret-value" not in artifact_value + assert "[REDACTED]" in artifact_value + + +@pytest.mark.asyncio +async def test_cancellation_uses_matching_job_id(): + fake = FakeJungleGridClient() + agent = agent_with_mocks(fake) + agent.executions["project-1"] = ProjectExecution("project-1", workload(), "estimate-1", {}, job_id="job_123") + + await agent.handle_project_message( + context( + "project.notification.message_received", + {"project_id": "project-1", "sender_id": "human:user", "content": {"text": "CANCEL job_123"}}, + ) + ) + + fake.cancel_job.assert_awaited_once_with("job_123", "Requested from OpenAgents by human:user") + + +@pytest.mark.asyncio +async def test_non_human_cancellation_is_rejected(): + fake = FakeJungleGridClient() + agent = agent_with_mocks(fake) + agent.executions["project-1"] = ProjectExecution("project-1", workload(), "estimate-1", {}, job_id="job_123") + + await agent.handle_project_message( + context( + "project.notification.message_received", + {"project_id": "project-1", "sender_id": "agent:other", "content": {"text": "CANCEL job_123"}}, + ) + ) + + fake.cancel_job.assert_not_awaited() + + +@pytest.mark.asyncio +@pytest.mark.parametrize("command", ["CANCEL job_456", " CANCEL job_123", "CANCEL job_123\n"]) +async def test_cancellation_requires_exact_command(command): + fake = FakeJungleGridClient() + agent = agent_with_mocks(fake) + agent.executions["project-1"] = ProjectExecution("project-1", workload(), "estimate-1", {}, job_id="job_123") + + await agent.handle_project_message( + context( + "project.notification.message_received", + {"project_id": "project-1", "sender_id": "human:user", "content": {"text": command}}, + ) + ) + + fake.cancel_job.assert_not_awaited() + + +@pytest.mark.asyncio +async def test_missing_api_key_is_reported_without_network_call(monkeypatch): + monkeypatch.delenv("JUNGLE_GRID_API_KEY", raising=False) + client = JungleGridClient() + with pytest.raises(JungleGridError, match="JUNGLE_GRID_API_KEY is required"): + await client.estimate_job(workload()) + + +def test_invalid_workload_is_rejected(): + with pytest.raises(ValueError, match="Missing required workload fields"): + parse_workload_goal('{"workload_type": "batch"}') + + +def test_workload_rejects_literal_credentials_and_secret_like_metadata(): + with pytest.raises(ValueError, match="must not contain API keys"): + parse_workload_goal(json.dumps({**workload(), "command": "curl -H 'Bearer secret-value'"})) + with pytest.raises(ValueError, match="secret-like keys"): + parse_workload_goal(json.dumps({**workload(), "metadata": {"api_token": "secret-value"}})) + + +def test_build_submit_payload_resolves_environment_only_at_submission(monkeypatch): + monkeypatch.setenv("MODEL_TOKEN", "secret-value") + requested = {**workload(), "environment_from_env": {"MODEL_TOKEN": "MODEL_TOKEN"}} + + assert "environment_from_env" not in build_estimate_payload(requested) + assert build_submit_payload(requested)["environment"] == {"MODEL_TOKEN": "secret-value"} + assert public_workload(requested)["environment_from_env"] == {"MODEL_TOKEN": "MODEL_TOKEN"} + + +def test_build_submit_payload_rejects_missing_local_environment(monkeypatch): + monkeypatch.delenv("MISSING_MODEL_TOKEN", raising=False) + requested = {**workload(), "environment_from_env": {"MODEL_TOKEN": "MISSING_MODEL_TOKEN"}} + + with pytest.raises(ValueError, match="MISSING_MODEL_TOKEN"): + build_submit_payload(requested) + + +def test_secret_redaction_removes_api_keys_and_bearer_tokens(): + text = redact_sensitive("failed with Bearer abc123 and jg_super_secret", "jg_super_secret") + assert "abc123" not in text + assert "jg_super_secret" not in text + assert "[REDACTED]" in text + + +def test_public_workload_redacts_metadata_values(): + shared = public_workload({**workload(), "metadata": {"nested": {"value": "secret"}}}) + assert shared["metadata"] == {"nested": "[REDACTED]"} + assert "secret" not in json.dumps(shared) + + +def test_project_data_redaction_removes_nested_workload_secrets(): + result = sanitize_project_data( + {"logs": [{"message": "token=secret-value"}], "error": "Bearer test-api-key"}, + ["secret-value", "test-api-key"], + ) + assert "secret-value" not in json.dumps(result) + assert "test-api-key" not in json.dumps(result) + + +def test_estimate_can_submit_honors_explicit_unavailability(): + assert estimate_can_submit({"available": True, "can_submit": True}) + assert not estimate_can_submit({"available": False}) + assert not estimate_can_submit({"can_submit": False}) + + +@pytest.mark.parametrize( + ("status", "label"), + [ + ("submitted", "submitted"), + ("queued", "queued"), + ("assigned", "assigned (provisioning)"), + ("running", "running"), + ("completed", "completed"), + ("failed", "failed"), + ("rejected", "rejected"), + ("cancelled", "cancelled"), + ], +) +def test_lifecycle_labels(status, label): + assert lifecycle_label(status) == label + + +class FakeResponse: + def __init__(self, status, text): + self.status = status + self._text = text + + async def text(self): + return self._text + + async def __aenter__(self): + return self + + async def __aexit__(self, exc_type, exc, tb): + return None + + +class FakeSession: + def __init__(self, response=None, error=None, **kwargs): + self.response = response + self.error = error + + def request(self, *args, **kwargs): + if self.error: + raise self.error + return self.response + + async def __aenter__(self): + return self + + async def __aexit__(self, exc_type, exc, tb): + return None + + +@pytest.mark.asyncio +async def test_invalid_jungle_grid_response(monkeypatch): + monkeypatch.setenv("JUNGLE_GRID_API_KEY", "test-api-key") + monkeypatch.setattr(MODULE.aiohttp, "ClientSession", lambda **kwargs: FakeSession(FakeResponse(200, "not-json"))) + client = JungleGridClient() + + with pytest.raises(JungleGridError, match="invalid JSON"): + await client.get_job("job_123") + + +@pytest.mark.asyncio +async def test_network_timeout_is_sanitized(monkeypatch): + monkeypatch.setenv("JUNGLE_GRID_API_KEY", "test-api-key") + monkeypatch.setattr( + MODULE.aiohttp, + "ClientSession", + lambda **kwargs: FakeSession(error=asyncio.TimeoutError()), + ) + client = JungleGridClient() + + with pytest.raises(JungleGridError, match="timed out"): + await client.get_job("job_123") + + +@pytest.mark.asyncio +async def test_api_error_is_sanitized(monkeypatch): + monkeypatch.setenv("JUNGLE_GRID_API_KEY", "test-api-key") + body = json.dumps({"error": {"code": "FORBIDDEN", "message": "Bearer test-api-key is not allowed"}}) + monkeypatch.setattr(MODULE.aiohttp, "ClientSession", lambda **kwargs: FakeSession(FakeResponse(403, body))) + client = JungleGridClient() + + with pytest.raises(JungleGridError) as exc_info: + await client.get_job("job_123") + assert exc_info.value.code == "FORBIDDEN" + assert "test-api-key" not in str(exc_info.value) From 7d6c00d375c63986ac5ff8766000739831a9e279 Mon Sep 17 00:00:00 2001 From: dejaguarkyng Date: Tue, 9 Jun 2026 09:48:52 +0000 Subject: [PATCH 2/5] fix: update jungle grid executor group auth --- .../IMPLEMENTATION_DECISION.md | 28 ++-- .../09_jungle_grid_gpu_execution/README.md | 54 +++++-- .../agents/jungle_grid_executor.py | 52 +++++-- .../09_jungle_grid_gpu_execution/network.yaml | 5 +- tests/agents/test_jungle_grid_executor.py | 144 +++++++++++++++++- 5 files changed, 250 insertions(+), 33 deletions(-) diff --git a/sdk/demos/09_jungle_grid_gpu_execution/IMPLEMENTATION_DECISION.md b/sdk/demos/09_jungle_grid_gpu_execution/IMPLEMENTATION_DECISION.md index 41375c1d0..05857178a 100644 --- a/sdk/demos/09_jungle_grid_gpu_execution/IMPLEMENTATION_DECISION.md +++ b/sdk/demos/09_jungle_grid_gpu_execution/IMPLEMENTATION_DECISION.md @@ -7,11 +7,11 @@ uses OpenAgents' project mod for the long-running workflow, project messages for estimate and lifecycle updates, and project artifacts for logs and Jungle Grid artifact metadata. -Jungle Grid is an external execution layer, not an OpenAgents transport, launcher -agent type, or network mod. A demo keeps the integration provider-specific while -showing a reusable OpenAgents pattern: an agent delegates asynchronous compute, -waits for human approval before billable work, and returns results to a shared -project. +Jungle Grid is an external agentic AI workload execution and GPU orchestration +layer, not an OpenAgents transport, launcher agent type, or network mod. A demo +keeps the integration provider-specific while showing a reusable OpenAgents +pattern: an agent delegates asynchronous compute, waits for human approval +before billable work, and returns results to a shared project. ## Rejected Alternatives @@ -21,10 +21,9 @@ project. provider-specific compute backend. - **Jungle Grid mod:** The integration does not add network-wide event semantics or shared infrastructure. Existing project events already cover the workflow. -- **Hosted MCP entry:** OpenAgents can load external MCP tools, but the current - Streamable HTTP MCP connector does not perform Jungle Grid's hosted OAuth flow or - attach API-key headers. Adding that capability solely for this demo would be a - core architecture change. +- **Hosted MCP entry:** Jungle Grid's hosted Streamable HTTP endpoint uses OAuth, + while local stdio uses an API key. The direct REST integration keeps approval + and project state inside OpenAgents without requiring an MCP auth change. - **Local stdio MCP dependency:** The Jungle Grid stdio MCP package is supported, but a direct Python API client is easier to validate, test, and constrain around mandatory human approval. It also avoids requiring Node.js for a Python demo. @@ -36,15 +35,24 @@ The demo uses the documented public execution API: - `POST /v1/jobs/estimate` - `POST /v1/jobs` - `GET /v1/jobs/{job_id}` +- `GET /v1/jobs/{job_id}/runtime` - `GET /v1/jobs/{job_id}/logs` - `POST /v1/jobs/{job_id}/cancel` - `GET /v1/jobs/{job_id}/artifacts` - `POST /v1/jobs/{job_id}/artifacts/{artifact_id}/download` -Authentication is a scoped server-side API key in `JUNGLE_GRID_API_KEY`. The +Authentication is a scoped server-side API key in `JUNGLE_GRID_API_KEY`; the +REST base can be overridden with `JUNGLE_GRID_API`. The documented lifecycle includes `pending`, `queued`, `assigned`, `running`, `completed`, `failed`, `rejected`, and `cancelled`. +The current REST request shape includes `model_size_gb`. Estimate responses +describe classification, routing, capacity, rates, cost ranges, queue waits, +start windows, warnings, and screening without starting compute. Managed +workloads can publish regular files from `/workspace/artifacts`; temporary +signed artifact download URLs are treated as secrets and are not stored in the +OpenAgents project. + Workload environment values are not accepted in project goals. A goal may use `environment_from_env` to reference variables available only in the executor process; those values are resolved after human approval and are excluded from diff --git a/sdk/demos/09_jungle_grid_gpu_execution/README.md b/sdk/demos/09_jungle_grid_gpu_execution/README.md index a9841b8df..599cf77ab 100644 --- a/sdk/demos/09_jungle_grid_gpu_execution/README.md +++ b/sdk/demos/09_jungle_grid_gpu_execution/README.md @@ -1,8 +1,9 @@ # Jungle Grid GPU Execution Demo This demo shows an OpenAgents execution agent delegating long-running AI and GPU -workloads to [Jungle Grid](https://junglegrid.dev), an execution layer that -places and runs AI workloads without requiring agents to manage GPU servers. +workloads to [Jungle Grid](https://junglegrid.dev), an agentic AI workload +execution and GPU orchestration layer that classifies intent, resolves capacity, +and places workloads without requiring agents to manage GPU servers. The workflow fits OpenAgents because the workload is asynchronous and collaborative: an agent estimates the job, a human approves spending in the @@ -30,7 +31,7 @@ approval, immediately before submission. - `JUNGLE_GRID_API_KEY` is required. The agent reads this server-side API key and sends it only as a Bearer token to Jungle Grid. -- `JUNGLEGRID_API_BASE` optionally overrides the default API base, +- `JUNGLE_GRID_API` optionally overrides the default REST API base, `https://api.junglegrid.dev`. - Any workload-specific variables referenced by `environment_from_env` must also be exported in the executor process. Their values are never placed in the @@ -54,6 +55,10 @@ export JUNGLE_GRID_API_KEY="jg_..." ## Run The Demo +The current demo assumes exactly one executor. Run one +`jungle-grid-executor` process so a project is estimated and submitted at most +once. + Start the OpenAgents network from this demo directory. The network enables the project mod and exposes the `Jungle Grid GPU Execution` project template: @@ -70,6 +75,13 @@ cd sdk/demos/09_jungle_grid_gpu_execution python agents/jungle_grid_executor.py ``` +The script connects with the password hash configured for the `executors` +group. OpenAgents records that connection in +`network.topology.agent_group_membership`, which is the runtime source used by +the project mod. The optional `metadata.agents` list in an agent-group +configuration does not assign runtime membership and is intentionally not used +by this demo. + Open Studio at `http://localhost:8700/studio`, create a project with the `Jungle Grid GPU Execution` template, and use a JSON object as the project goal. For example: @@ -79,14 +91,18 @@ For example: "name": "openagents-batch-demo", "workload_type": "batch", "image": "python:3.11-slim", + "model_size_gb": 1, "command": "python", "args": ["-c", "print('hello from Jungle Grid')"], "optimize_for": "cost" } ``` -The agent validates the request and calls Jungle Grid's estimate endpoint. It -posts the structured estimate and stores it as project artifact +The agent validates the request and calls the read-only +`POST /v1/jobs/estimate` endpoint. Current estimates include workload +classification, routing and capacity signals, hourly and total cost ranges, +queue-wait ranges, estimated start windows, warnings, and screening details. +The executor posts that structured estimate and stores it as project artifact `jungle_grid_estimate`. No compute has been submitted at this point. For a workload that needs a credential or other environment value, export it in @@ -101,6 +117,7 @@ export MODEL_TOKEN="..." "name": "openagents-inference-demo", "workload_type": "inference", "image": "example/model-server:latest", + "model_size_gb": 7, "environment_from_env": { "MODEL_TOKEN": "MODEL_TOKEN" }, @@ -120,10 +137,16 @@ the agent. Estimates that explicitly report `available: false` or APPROVE ``` -After approval, the agent submits the workload, posts status changes such as -submitted, queued, assigned/provisioning, running, completed, failed, rejected, -or cancelled, and stores the final job details, logs, artifact list, and -temporary download metadata in project artifact `jungle_grid_result`. +After approval, the agent submits with `POST /v1/jobs`, polls +`GET /v1/jobs/{job_id}`, and posts public lifecycle changes: pending, queued, +assigned, running, completed, failed, rejected, or cancelled. On a terminal +state it retrieves the runtime surface, the latest 100 stored log entries, and +the managed artifact list. Regular files written by managed workloads under +`/workspace/artifacts` are eligible for automatic upload. + +Artifact download requests mint temporary signed URLs. The executor requests +download metadata but redacts the URL before storing `jungle_grid_result`; do +not log or share signed URLs. To cancel a submitted job, reply with the exact job ID: @@ -142,6 +165,19 @@ invalid Jungle Grid responses, and API errors are posted to the project in sanitized form. Failed, rejected, or cancelled jobs stop the OpenAgents project. Completed jobs complete the project. +The API key needs `jobs:estimate`, `jobs:submit`, `jobs:read`, and `logs:read` +capabilities for the complete flow. + +## Jungle Grid Interfaces + +This demo calls the REST API directly so OpenAgents can enforce project-based +human approval. Jungle Grid also provides the `jungle` CLI, whose `submit` +command estimates and asks for confirmation before queuing, and a hosted MCP +endpoint at `https://mcp.junglegrid.dev/mcp`. Hosted MCP uses OAuth; local stdio +MCP uses `JUNGLE_GRID_API_KEY`. The current MCP tools are `estimate_job`, +`submit_job`, `list_jobs`, `get_job`, `get_job_logs`, `cancel_job`, +`list_artifacts`, and `get_artifact`. + ## Tests Run the focused mocked tests. They do not contact Jungle Grid or submit paid diff --git a/sdk/demos/09_jungle_grid_gpu_execution/agents/jungle_grid_executor.py b/sdk/demos/09_jungle_grid_gpu_execution/agents/jungle_grid_executor.py index 6e8a369c2..23348120b 100644 --- a/sdk/demos/09_jungle_grid_gpu_execution/agents/jungle_grid_executor.py +++ b/sdk/demos/09_jungle_grid_gpu_execution/agents/jungle_grid_executor.py @@ -20,26 +20,32 @@ logger = logging.getLogger(__name__) DEFAULT_API_BASE = "https://api.junglegrid.dev" +EXECUTORS_GROUP_PASSWORD_HASH = "8fba13dab71d6fdd8a9b9db1f06e81315dfbfd69167b6097f724604db3c91cdf" TERMINAL_STATUSES = {"completed", "failed", "rejected", "cancelled"} -VALID_WORKLOAD_TYPES = {"inference", "training", "fine-tuning", "batch"} +VALID_WORKLOAD_TYPES = {"inference", "training", "batch"} VALID_OPTIMIZE_FOR = {"balanced", "cost", "speed"} SUBMIT_FIELDS = { "name", "workload_type", "image", + "model_size_gb", "command", "args", "environment_from_env", "optimize_for", + "constraints", "template", "metadata", } ESTIMATE_FIELDS = { + "name", "workload_type", "image", + "model_size_gb", "command", "args", "optimize_for", + "constraints", "template", } SENSITIVE_PATTERN = re.compile(r"(?i)(bearer\s+)[^\s,;]+|jg_[A-Za-z0-9_-]+") @@ -116,8 +122,14 @@ def _error_detail(data: Any, status: int) -> tuple[str, str]: if isinstance(data, dict): nested = data.get("error") if isinstance(nested, dict): - return str(nested.get("code") or "API_ERROR"), str(nested.get("message") or f"HTTP {status}") - return str(data.get("code") or "API_ERROR"), str(data.get("message") or f"HTTP {status}") + return ( + redact_sensitive(nested.get("code") or "API_ERROR"), + redact_sensitive(nested.get("message") or f"HTTP {status}"), + ) + return ( + redact_sensitive(data.get("code") or "API_ERROR"), + redact_sensitive(data.get("message") or f"HTTP {status}"), + ) return "API_ERROR", f"HTTP {status}" @@ -129,7 +141,7 @@ def __init__( api_base: Optional[str] = None, timeout_seconds: float = 30.0, ): - raw_api_base = api_base if api_base is not None else os.getenv("JUNGLEGRID_API_BASE", DEFAULT_API_BASE) + raw_api_base = api_base if api_base is not None else os.getenv("JUNGLE_GRID_API", DEFAULT_API_BASE) self.api_key = os.getenv("JUNGLE_GRID_API_KEY", "").strip() self.api_base = raw_api_base.rstrip("/") self.timeout_seconds = timeout_seconds @@ -159,7 +171,11 @@ async def _request(self, method: str, path: str, payload: Optional[Dict[str, Any ) from exc if response.status < 200 or response.status >= 300: code, message = _error_detail(data, response.status) - raise JungleGridError(code, redact_sensitive(message, api_key), response.status) + raise JungleGridError( + redact_sensitive(code, api_key), + redact_sensitive(message, api_key), + response.status, + ) except asyncio.TimeoutError as exc: raise JungleGridError("NETWORK_TIMEOUT", "Jungle Grid request timed out.") from exc except aiohttp.ClientError as exc: @@ -179,8 +195,11 @@ async def submit_job(self, workload: Dict[str, Any]) -> Dict[str, Any]: async def get_job(self, job_id: str) -> Dict[str, Any]: return await self._request("GET", f"/v1/jobs/{quote(job_id, safe='')}") + async def get_job_runtime(self, job_id: str) -> Dict[str, Any]: + return await self._request("GET", f"/v1/jobs/{quote(job_id, safe='')}/runtime") + async def get_job_logs(self, job_id: str) -> Dict[str, Any]: - return await self._request("GET", f"/v1/jobs/{quote(job_id, safe='')}/logs") + return await self._request("GET", f"/v1/jobs/{quote(job_id, safe='')}/logs?tail=100") async def cancel_job(self, job_id: str, reason: str) -> Dict[str, Any]: return await self._request("POST", f"/v1/jobs/{quote(job_id, safe='')}/cancel", {"reason": reason}) @@ -217,6 +236,9 @@ def parse_workload_goal(goal: str) -> Dict[str, Any]: missing = sorted(key for key in required if not isinstance(workload.get(key), str) or not workload[key].strip()) if missing: raise ValueError(f"Missing required workload fields: {', '.join(missing)}.") + model_size_gb = workload.get("model_size_gb") + if not isinstance(model_size_gb, (int, float)) or isinstance(model_size_gb, bool) or model_size_gb <= 0: + raise ValueError("model_size_gb must be a positive number.") if workload["workload_type"] not in VALID_WORKLOAD_TYPES: raise ValueError(f"workload_type must be one of: {', '.join(sorted(VALID_WORKLOAD_TYPES))}.") if "optimize_for" in workload and workload["optimize_for"] not in VALID_OPTIMIZE_FOR: @@ -514,9 +536,14 @@ async def _monitor(self, execution: ProjectExecution): async def _finalize(self, execution: ProjectExecution, job: Dict[str, Any]): assert execution.job_id + runtime: Dict[str, Any] = {} logs: Dict[str, Any] = {} artifacts: Dict[str, Any] = {} downloads = [] + try: + runtime = await self.jungle_grid.get_job_runtime(execution.job_id) + except JungleGridError as exc: + runtime = {"error": redact_sensitive(exc, self.jungle_grid.api_key)} try: logs = await self.jungle_grid.get_job_logs(execution.job_id) except JungleGridError as exc: @@ -528,12 +555,15 @@ async def _finalize(self, execution: ProjectExecution, job: Dict[str, Any]): continue artifact_id = str(artifact.get("artifact_id") or artifact.get("id") or "").strip() if artifact_id: - downloads.append(await self.jungle_grid.get_artifact(execution.job_id, artifact_id)) + download = await self.jungle_grid.get_artifact(execution.job_id, artifact_id) + if "url" in download: + download = {**download, "url": "[REDACTED]"} + downloads.append(download) except JungleGridError as exc: artifacts = {"error": redact_sensitive(exc, self.jungle_grid.api_key)} result = self._sanitize_for_project( - {"job": job, "logs": logs, "artifacts": artifacts, "downloads": downloads}, + {"job": job, "runtime": runtime, "logs": logs, "artifacts": artifacts, "downloads": downloads}, execution, ) await self._set_artifact(execution.project_id, "jungle_grid_result", result) @@ -560,7 +590,11 @@ async def main(): logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s") agent = JungleGridExecutorAgent() try: - await agent.async_start(network_host="localhost", network_port=8700) + await agent.async_start( + network_host="localhost", + network_port=8700, + password_hash=EXECUTORS_GROUP_PASSWORD_HASH, + ) while True: await asyncio.sleep(3600) finally: diff --git a/sdk/demos/09_jungle_grid_gpu_execution/network.yaml b/sdk/demos/09_jungle_grid_gpu_execution/network.yaml index dd935306c..30c0f5012 100644 --- a/sdk/demos/09_jungle_grid_gpu_execution/network.yaml +++ b/sdk/demos/09_jungle_grid_gpu_execution/network.yaml @@ -20,11 +20,10 @@ network: agent_groups: executors: description: Agents allowed to execute Jungle Grid project workflows + password_hash: 8fba13dab71d6fdd8a9b9db1f06e81315dfbfd69167b6097f724604db3c91cdf metadata: permissions: - execute_external_compute - agents: - - jungle-grid-executor mods: - name: openagents.mods.workspace.default enabled: true @@ -40,7 +39,7 @@ network: description: Estimate, approve, execute, and monitor an AI workload on Jungle Grid expose_as_tool: true tool_name: run_jungle_grid_workload - tool_description: Start a Jungle Grid workload project. The task must be a JSON object with name, workload_type, and image; use environment_from_env for workload environment values. + tool_description: Start a Jungle Grid workload project. The task must be a JSON object with name, workload_type, image, and model_size_gb; use environment_from_env for workload environment values. tool_mode: async agent_groups: - executors diff --git a/tests/agents/test_jungle_grid_executor.py b/tests/agents/test_jungle_grid_executor.py index 288cc8fce..aa9bf248e 100644 --- a/tests/agents/test_jungle_grid_executor.py +++ b/tests/agents/test_jungle_grid_executor.py @@ -4,12 +4,18 @@ import importlib.util import json from pathlib import Path +from types import SimpleNamespace from unittest.mock import AsyncMock import pytest +import yaml +from openagents.core.network import AgentNetwork from openagents.models.event import Event from openagents.models.event_context import EventContext +from openagents.models.network_config import AgentGroupConfig, NetworkConfig +from openagents.models.transport import TransportType +from openagents.mods.workspace.project.mod import DefaultProjectNetworkMod MODULE_PATH = ( Path(__file__).parent.parent.parent @@ -19,6 +25,7 @@ / "agents" / "jungle_grid_executor.py" ) +NETWORK_CONFIG_PATH = MODULE_PATH.parent.parent / "network.yaml" SPEC = importlib.util.spec_from_file_location("jungle_grid_executor", MODULE_PATH) MODULE = importlib.util.module_from_spec(SPEC) assert SPEC and SPEC.loader @@ -28,6 +35,7 @@ JungleGridError = MODULE.JungleGridError JungleGridExecutorAgent = MODULE.JungleGridExecutorAgent ProjectExecution = MODULE.ProjectExecution +EXECUTORS_GROUP_PASSWORD_HASH = MODULE.EXECUTORS_GROUP_PASSWORD_HASH build_estimate_payload = MODULE.build_estimate_payload build_submit_payload = MODULE.build_submit_payload estimate_can_submit = MODULE.estimate_can_submit @@ -51,6 +59,7 @@ def workload(): "name": "batch-demo", "workload_type": "batch", "image": "python:3.11-slim", + "model_size_gb": 1, "command": "python", "args": ["-c", "print(42)"], "optimize_for": "cost", @@ -63,6 +72,7 @@ def __init__(self): self.estimate_job = AsyncMock(return_value={"available": True, "estimated_cost_usd": {"min": 0.1, "max": 0.2}}) self.submit_job = AsyncMock(return_value={"job_id": "job_123", "status": "queued"}) self.get_job = AsyncMock(return_value={"job_id": "job_123", "status": "completed"}) + self.get_job_runtime = AsyncMock(return_value={"exit_code": 0, "stdout_tail": "done"}) self.get_job_logs = AsyncMock(return_value={"items": [{"message": "done"}]}) self.cancel_job = AsyncMock(return_value={"job_id": "job_123", "status": "cancelled", "cancelled": True}) self.list_artifacts = AsyncMock( @@ -86,6 +96,88 @@ def agent_with_mocks(fake=None): return agent +@pytest.mark.asyncio +async def test_executor_group_membership_delivers_project_start_and_returns_estimate(): + network_yaml = yaml.safe_load(NETWORK_CONFIG_PATH.read_text()) + executor_group = network_yaml["network"]["agent_groups"]["executors"] + assert executor_group["password_hash"] == EXECUTORS_GROUP_PASSWORD_HASH + assert "agents" not in executor_group.get("metadata", {}) + + config = NetworkConfig( + name="JungleGridGroupTest", + default_agent_group="guest", + requires_password=False, + agent_groups={"executors": AgentGroupConfig(**executor_group)}, + ) + network = AgentNetwork.create_from_config(config) + registration = await network.register_agent( + agent_id="jungle-grid-executor", + transport_type=TransportType.HTTP, + metadata={"name": "Jungle Grid Executor"}, + certificate=None, + password_hash=EXECUTORS_GROUP_PASSWORD_HASH, + ) + assert registration.success + assert network.topology.agent_group_membership["jungle-grid-executor"] == "executors" + + project_mod = DefaultProjectNetworkMod() + project_mod.update_config( + { + "project_templates": { + "jungle_grid_execution": { + "name": "Jungle Grid GPU Execution", + "agent_groups": ["executors"], + } + } + } + ) + project_mod.initialize() + project_mod.bind_network(network) + assert project_mod._get_agents_in_group("executors") == ["jungle-grid-executor"] + + fake = FakeJungleGridClient() + executor = agent_with_mocks(fake) + delivered = [] + + async def deliver(event): + delivered.append(event) + if event.destination_id == "jungle-grid-executor": + await executor.handle_project_started( + EventContext( + incoming_event=event, + event_threads={}, + incoming_thread_id="project-start", + ) + ) + return SimpleNamespace(success=True) + + project_mod.send_event = AsyncMock(side_effect=deliver) + response = await project_mod.process_system_message( + Event( + event_name="project.start", + source_id="human:project-owner", + payload={ + "template_id": "jungle_grid_execution", + "goal": json.dumps(workload()), + "name": "Jungle Grid test", + }, + ) + ) + + assert response.success + assert "jungle-grid-executor" in response.data["authorized_agents"] + assert any( + event.event_name == "project.notification.started" + and event.destination_id == "jungle-grid-executor" + and event.payload["initiator_agent_id"] == "human:project-owner" + for event in delivered + ) + fake.estimate_job.assert_awaited_once_with(build_estimate_payload(workload())) + estimate_message = executor.project_adapter.send_project_message.await_args.kwargs["content"]["text"] + assert "Jungle Grid estimate ready" in estimate_message + assert "APPROVE" in estimate_message + + @pytest.mark.asyncio async def test_successful_estimate_flow_posts_estimate_and_requires_approval(): fake = FakeJungleGridClient() @@ -250,12 +342,16 @@ async def test_logs_and_artifacts_are_stored_in_project_artifact(): await agent._finalize(execution, {"job_id": "job_123", "status": "completed"}) + fake.get_job_runtime.assert_awaited_once_with("job_123") fake.get_job_logs.assert_awaited_once_with("job_123") fake.list_artifacts.assert_awaited_once_with("job_123") fake.get_artifact.assert_awaited_once_with("job_123", "artifact_1") artifact_call = agent.project_adapter.set_project_artifact.await_args assert artifact_call.kwargs["key"] == "jungle_grid_result" assert "output.json" in artifact_call.kwargs["value"] + assert "stdout_tail" in artifact_call.kwargs["value"] + assert "https://example.test/file" not in artifact_call.kwargs["value"] + assert "[REDACTED]" in artifact_call.kwargs["value"] @pytest.mark.asyncio @@ -344,6 +440,20 @@ def test_invalid_workload_is_rejected(): parse_workload_goal('{"workload_type": "batch"}') +def test_workload_requires_positive_model_size(): + with pytest.raises(ValueError, match="model_size_gb"): + parse_workload_goal(json.dumps({**workload(), "model_size_gb": 0})) + + +def test_estimate_payload_matches_current_draft_job_fields(): + requested = { + **workload(), + "constraints": {"max_price_per_hour": 2.5, "preferred_gpu_family": "l4"}, + } + + assert build_estimate_payload(requested) == requested + + def test_workload_rejects_literal_credentials_and_secret_like_metadata(): with pytest.raises(ValueError, match="must not contain API keys"): parse_workload_goal(json.dumps({**workload(), "command": "curl -H 'Bearer secret-value'"})) @@ -472,11 +582,41 @@ async def test_network_timeout_is_sanitized(monkeypatch): @pytest.mark.asyncio async def test_api_error_is_sanitized(monkeypatch): monkeypatch.setenv("JUNGLE_GRID_API_KEY", "test-api-key") - body = json.dumps({"error": {"code": "FORBIDDEN", "message": "Bearer test-api-key is not allowed"}}) + body = json.dumps( + { + "error": { + "code": "provider_jg_private_backend", + "message": "Bearer test-api-key is not allowed", + } + } + ) monkeypatch.setattr(MODULE.aiohttp, "ClientSession", lambda **kwargs: FakeSession(FakeResponse(403, body))) client = JungleGridClient() with pytest.raises(JungleGridError) as exc_info: await client.get_job("job_123") - assert exc_info.value.code == "FORBIDDEN" + assert "jg_private_backend" not in exc_info.value.code + assert "[REDACTED]" in exc_info.value.code assert "test-api-key" not in str(exc_info.value) + + +def test_client_uses_documented_rest_api_environment(monkeypatch): + monkeypatch.setenv("JUNGLE_GRID_API", "https://orchestrator.example.test/") + monkeypatch.setenv("JUNGLE_GRID_API_KEY", "test-api-key") + + client = JungleGridClient() + + assert client.api_base == "https://orchestrator.example.test" + + +@pytest.mark.asyncio +async def test_client_uses_documented_runtime_and_log_routes(monkeypatch): + monkeypatch.setenv("JUNGLE_GRID_API_KEY", "test-api-key") + client = JungleGridClient() + client._request = AsyncMock(return_value={}) + + await client.get_job_runtime("job_123") + await client.get_job_logs("job_123") + + assert client._request.await_args_list[0].args == ("GET", "/v1/jobs/job_123/runtime") + assert client._request.await_args_list[1].args == ("GET", "/v1/jobs/job_123/logs?tail=100") From 1164763e289f5055d661b6c1c03c347882caae64 Mon Sep 17 00:00:00 2001 From: dejaguarkyng Date: Thu, 11 Jun 2026 15:03:07 +0000 Subject: [PATCH 3/5] feat: align jungle grid demo with current job workflow --- .../agents/jungle_grid_executor.py | 1083 ++++++++++++----- .../09_jungle_grid_gpu_execution/network.yaml | 7 +- 2 files changed, 752 insertions(+), 338 deletions(-) diff --git a/sdk/demos/09_jungle_grid_gpu_execution/agents/jungle_grid_executor.py b/sdk/demos/09_jungle_grid_gpu_execution/agents/jungle_grid_executor.py index 23348120b..a12fc9778 100644 --- a/sdk/demos/09_jungle_grid_gpu_execution/agents/jungle_grid_executor.py +++ b/sdk/demos/09_jungle_grid_gpu_execution/agents/jungle_grid_executor.py @@ -1,15 +1,18 @@ #!/usr/bin/env python3 -"""Jungle Grid execution agent for the OpenAgents project workflow demo.""" +"""Human-approved Jungle Grid execution through an OpenAgents project.""" + +from __future__ import annotations import asyncio +import copy import json import logging import os import re import uuid -from dataclasses import dataclass -from typing import Any, Dict, Iterable, Optional -from urllib.parse import quote +from dataclasses import asdict, dataclass, field +from typing import Any, Awaitable, Callable, Iterable, Mapping, Optional +from urllib.parse import quote, urlencode import aiohttp @@ -21,35 +24,71 @@ DEFAULT_API_BASE = "https://api.junglegrid.dev" EXECUTORS_GROUP_PASSWORD_HASH = "8fba13dab71d6fdd8a9b9db1f06e81315dfbfd69167b6097f724604db3c91cdf" -TERMINAL_STATUSES = {"completed", "failed", "rejected", "cancelled"} -VALID_WORKLOAD_TYPES = {"inference", "training", "batch"} +STATE_ARTIFACT = "jungle_grid_execution_state" +TERMINAL_STATUSES = {"completed", "failed", "rejected", "cancelled", "canceled"} +VALID_WORKLOAD_TYPES = {"inference", "training", "fine_tuning", "batch"} VALID_OPTIMIZE_FOR = {"balanced", "cost", "speed"} +VALID_GPU_CLASSES = {"consumer", "datacenter"} +VALID_REGION_MODES = {"prefer", "strict"} +VALID_PRIORITIES = {"low", "balanced", "high", "low_latency", "low_cost", "high_reliability"} +VALID_PRECISIONS = {"fp32", "fp16", "bf16", "int8"} +CONSTRAINT_FIELDS = { + "max_price_per_hour", + "gpu_type", + "gpu_class", + "preferred_gpu_family", + "avoid_gpu_families", + "region_preference", + "region_mode", + "latency_priority", + "cost_priority", +} +MAX_SHARED_LOGS = 200 +MAX_SHARED_EVENTS = 200 + SUBMIT_FIELDS = { "name", "workload_type", "image", - "model_size_gb", "command", "args", "environment_from_env", - "optimize_for", - "constraints", + "input_files", + "script_files", + "expected_artifacts", "template", "metadata", -} -ESTIMATE_FIELDS = { - "name", - "workload_type", - "image", + "callback", "model_size_gb", - "command", - "args", + "batch_size", + "precision", + "disk_gb", + "gpu_required", + "gpu_count", + "gpu_type", + "gpu_class", + "min_vram_gb", + "max_price_per_hour", + "preferred_gpu_family", + "avoid_gpu_families", + "region_preference", + "region_mode", + "priority", + "latency_priority", + "cost_priority", + "timeout_seconds", + "routing_mode", "optimize_for", "constraints", - "template", } -SENSITIVE_PATTERN = re.compile(r"(?i)(bearer\s+)[^\s,;]+|jg_[A-Za-z0-9_-]+") -SENSITIVE_KEY_PATTERN = re.compile(r"(?i)(api[_-]?key|authorization|password|secret|token)") +ESTIMATE_FIELDS = SUBMIT_FIELDS - {"environment_from_env"} +SECRET_KEY_PATTERN = re.compile( + r"(?i)(api[_-]?key|authorization|password|secret|token|auth_token|upload_url|download_url|complete_url)" +) +SECRET_TEXT_PATTERN = re.compile( + r"(?i)(bearer\s+)[^\s,;]+|(? str: - """Return a log-safe string with credentials removed.""" +def redact_sensitive(value: object, secrets: Iterable[str] = ()) -> str: + """Return a project-safe string.""" text = str(value) - if secret: - text = text.replace(secret, "[REDACTED]") - return SENSITIVE_PATTERN.sub(lambda match: f"{match.group(1) or ''}[REDACTED]", text) + for secret in secrets: + if secret: + text = text.replace(secret, "[REDACTED]") + return SECRET_TEXT_PATTERN.sub(lambda match: f"{match.group(1) or ''}[REDACTED]", text) -def _collect_string_values(value: Any) -> list[str]: - """Collect nested string values that must not be exposed in project output.""" - if isinstance(value, str): - return [value] if value else [] - if isinstance(value, dict): - strings = [] - for nested in value.values(): - strings.extend(_collect_string_values(nested)) - return strings - if isinstance(value, list): - strings = [] - for nested in value: - strings.extend(_collect_string_values(nested)) - return strings - return [] - - -def _contains_sensitive_key(value: Any) -> bool: - """Return whether nested data uses a key commonly associated with credentials.""" - if isinstance(value, dict): +def contains_sensitive_key(value: object) -> bool: + if isinstance(value, Mapping): return any( - SENSITIVE_KEY_PATTERN.search(str(key)) or _contains_sensitive_key(nested) for key, nested in value.items() + SECRET_KEY_PATTERN.search(str(key)) or contains_sensitive_key(nested) for key, nested in value.items() ) if isinstance(value, list): - return any(_contains_sensitive_key(nested) for nested in value) + return any(contains_sensitive_key(nested) for nested in value) return False -def sanitize_project_data(value: Any, secrets: Iterable[str]) -> Any: - """Recursively redact credentials and workload-provided secret values.""" +def sanitize_project_data(value: object, secrets: Iterable[str] = ()) -> object: + """Recursively redact credentials, signed URLs, and resolved environment values.""" secret_values = [secret for secret in secrets if secret] if isinstance(value, str): - result = value - for secret in secret_values: - result = result.replace(secret, "[REDACTED]") - return redact_sensitive(result) - if isinstance(value, dict): - return {key: sanitize_project_data(nested, secret_values) for key, nested in value.items()} + return redact_sensitive(value, secret_values) + if isinstance(value, Mapping): + result: dict[str, object] = {} + for key, nested in value.items(): + clean_key = str(key) + if SECRET_KEY_PATTERN.search(clean_key): + result[clean_key] = "[REDACTED]" + else: + result[clean_key] = sanitize_project_data(nested, secret_values) + return result if isinstance(value, list): return [sanitize_project_data(nested, secret_values) for nested in value] return value -def _unwrap_response(data: Any) -> Any: - if isinstance(data, dict) and data.get("ok") is True and "data" in data: +def unwrap_response(data: object) -> object: + if isinstance(data, Mapping) and data.get("ok") is True and "data" in data: return data["data"] return data -def _error_detail(data: Any, status: int) -> tuple[str, str]: - if isinstance(data, dict): +def error_detail(data: object, status: int) -> tuple[str, str]: + if isinstance(data, Mapping): nested = data.get("error") - if isinstance(nested, dict): - return ( - redact_sensitive(nested.get("code") or "API_ERROR"), - redact_sensitive(nested.get("message") or f"HTTP {status}"), - ) + source = nested if isinstance(nested, Mapping) else data return ( - redact_sensitive(data.get("code") or "API_ERROR"), - redact_sensitive(data.get("message") or f"HTTP {status}"), + redact_sensitive(source.get("code") or "API_ERROR"), + redact_sensitive(source.get("message") or f"HTTP {status}"), ) return "API_ERROR", f"HTTP {status}" class JungleGridClient: - """Small async client for Jungle Grid's documented public execution API.""" + """Async client matching the current Jungle Grid MCP-backed REST contract.""" def __init__( self, api_base: Optional[str] = None, timeout_seconds: float = 30.0, + read_retries: int = 2, + retry_delay_seconds: float = 0.5, + sleep: Callable[[float], Awaitable[None]] = asyncio.sleep, ): - raw_api_base = api_base if api_base is not None else os.getenv("JUNGLE_GRID_API", DEFAULT_API_BASE) + configured_base = ( + api_base + or os.getenv("JUNGLEGRID_API_BASE") + or os.getenv("JUNGLE_GRID_API_URL") + or os.getenv("JUNGLE_GRID_API") + or DEFAULT_API_BASE + ) self.api_key = os.getenv("JUNGLE_GRID_API_KEY", "").strip() - self.api_base = raw_api_base.rstrip("/") + self.api_base = configured_base.strip().rstrip("/") self.timeout_seconds = timeout_seconds + self.read_retries = max(0, read_retries) + self.retry_delay_seconds = max(0.0, retry_delay_seconds) + self.sleep = sleep def _require_api_key(self) -> str: if not self.api_key: raise JungleGridError("MISSING_API_KEY", "JUNGLE_GRID_API_KEY is required.") return self.api_key - async def _request(self, method: str, path: str, payload: Optional[Dict[str, Any]] = None) -> Dict[str, Any]: + async def _request( + self, + method: str, + path: str, + payload: Optional[dict[str, object]] = None, + ) -> dict[str, object]: api_key = self._require_api_key() - timeout = aiohttp.ClientTimeout(total=self.timeout_seconds) - headers = { - "Accept": "application/json", - "Authorization": f"Bearer {api_key}", - "Content-Type": "application/json", - } - try: - async with aiohttp.ClientSession(timeout=timeout) as session: - async with session.request(method, f"{self.api_base}{path}", headers=headers, json=payload) as response: - text = await response.text() - try: - data = json.loads(text) if text.strip() else {} - except json.JSONDecodeError as exc: - raise JungleGridError( - "INVALID_API_RESPONSE", "Jungle Grid returned invalid JSON.", response.status - ) from exc - if response.status < 200 or response.status >= 300: - code, message = _error_detail(data, response.status) - raise JungleGridError( - redact_sensitive(code, api_key), - redact_sensitive(message, api_key), - response.status, - ) - except asyncio.TimeoutError as exc: - raise JungleGridError("NETWORK_TIMEOUT", "Jungle Grid request timed out.") from exc - except aiohttp.ClientError as exc: - raise JungleGridError("NETWORK_ERROR", redact_sensitive(exc, api_key)) from exc - - result = _unwrap_response(data) - if not isinstance(result, dict): - raise JungleGridError("INVALID_API_RESPONSE", "Jungle Grid returned an unexpected response shape.") - return result + attempts = self.read_retries + 1 if method == "GET" else 1 + for attempt in range(attempts): + try: + timeout = aiohttp.ClientTimeout(total=self.timeout_seconds) + headers = { + "Accept": "application/json", + "Authorization": f"Bearer {api_key}", + "Content-Type": "application/json", + } + async with aiohttp.ClientSession(timeout=timeout) as session: + async with session.request( + method, f"{self.api_base}{path}", headers=headers, json=payload + ) as response: + text = await response.text() + try: + data = json.loads(text) if text.strip() else {} + except json.JSONDecodeError as exc: + raise JungleGridError( + "INVALID_API_RESPONSE", + "Jungle Grid returned invalid JSON.", + response.status, + ) from exc + if not 200 <= response.status < 300: + code, message = error_detail(data, response.status) + raise JungleGridError(code, message, response.status) + result = unwrap_response(data) + if not isinstance(result, dict): + raise JungleGridError( + "INVALID_API_RESPONSE", + "Jungle Grid returned an unexpected response shape.", + ) + return result + except (asyncio.TimeoutError, aiohttp.ClientError) as exc: + if attempt + 1 < attempts: + await self.sleep(self.retry_delay_seconds * (2**attempt)) + continue + code = "NETWORK_TIMEOUT" if isinstance(exc, asyncio.TimeoutError) else "NETWORK_ERROR" + message = ( + "Jungle Grid request timed out." + if code == "NETWORK_TIMEOUT" + else "Jungle Grid network request failed." + ) + raise JungleGridError(code, message) from exc + except JungleGridError as exc: + retryable = method == "GET" and (exc.status is None or exc.status == 429 or exc.status >= 500) + if retryable and attempt + 1 < attempts: + await self.sleep(self.retry_delay_seconds * (2**attempt)) + continue + raise JungleGridError( + redact_sensitive(exc.code, [api_key]), + redact_sensitive(exc, [api_key]), + exc.status, + ) from exc + raise JungleGridError("NETWORK_ERROR", "Jungle Grid request failed.") - async def estimate_job(self, workload: Dict[str, Any]) -> Dict[str, Any]: - return await self._request("POST", "/v1/jobs/estimate", workload) + async def estimate_job(self, workload: dict[str, object]) -> dict[str, object]: + return await self._request("POST", "/v1/mcp/jobs/estimate", workload) - async def submit_job(self, workload: Dict[str, Any]) -> Dict[str, Any]: - return await self._request("POST", "/v1/jobs", workload) + async def submit_job(self, workload: dict[str, object]) -> dict[str, object]: + return await self._request("POST", "/v1/mcp/jobs", workload) - async def get_job(self, job_id: str) -> Dict[str, Any]: - return await self._request("GET", f"/v1/jobs/{quote(job_id, safe='')}") + async def get_job(self, job_id: str) -> dict[str, object]: + return await self._request("GET", f"/v1/mcp/jobs/{quote(job_id, safe='')}") - async def get_job_runtime(self, job_id: str) -> Dict[str, Any]: - return await self._request("GET", f"/v1/jobs/{quote(job_id, safe='')}/runtime") + async def get_job_events(self, job_id: str) -> dict[str, object]: + return await self._request("GET", f"/v1/jobs/{quote(job_id, safe='')}/events") - async def get_job_logs(self, job_id: str) -> Dict[str, Any]: - return await self._request("GET", f"/v1/jobs/{quote(job_id, safe='')}/logs?tail=100") + async def get_job_logs( + self, + job_id: str, + *, + limit: int = 100, + cursor: Optional[str] = None, + tail: Optional[int] = None, + ) -> dict[str, object]: + params: dict[str, object] = {"limit": limit} + if cursor: + params["cursor"] = cursor + if tail is not None: + params["tail"] = tail + return await self._request("GET", f"/v1/mcp/jobs/{quote(job_id, safe='')}/logs?{urlencode(params)}") + + async def get_job_runtime(self, job_id: str) -> dict[str, object]: + return await self._request("GET", f"/v1/jobs/{quote(job_id, safe='')}/runtime") - async def cancel_job(self, job_id: str, reason: str) -> Dict[str, Any]: - return await self._request("POST", f"/v1/jobs/{quote(job_id, safe='')}/cancel", {"reason": reason}) + async def cancel_job(self, job_id: str, reason: str) -> dict[str, object]: + return await self._request( + "POST", + f"/v1/mcp/jobs/{quote(job_id, safe='')}/cancel", + {"reason": reason}, + ) - async def list_artifacts(self, job_id: str) -> Dict[str, Any]: - return await self._request("GET", f"/v1/jobs/{quote(job_id, safe='')}/artifacts") + async def list_artifacts(self, job_id: str) -> dict[str, object]: + return await self._request("GET", f"/v1/mcp/jobs/{quote(job_id, safe='')}/artifacts") - async def get_artifact(self, job_id: str, artifact_id: str) -> Dict[str, Any]: + async def get_artifact(self, job_id: str, artifact_id: str) -> dict[str, object]: return await self._request( "POST", - f"/v1/jobs/{quote(job_id, safe='')}/artifacts/{quote(artifact_id, safe='')}/download", + f"/v1/mcp/jobs/{quote(job_id, safe='')}/artifacts/{quote(artifact_id, safe='')}/download", ) -def parse_workload_goal(goal: str) -> Dict[str, Any]: - """Parse and validate a project goal containing a Jungle Grid workload JSON object.""" +def _string(value: object, field_name: str) -> str: + if not isinstance(value, str) or not value.strip(): + raise ValueError(f"{field_name} must be a non-empty string.") + return value.strip() + + +def _string_array(value: object, field_name: str) -> list[str]: + if not isinstance(value, list) or not all(isinstance(item, str) and item for item in value): + raise ValueError(f"{field_name} must be an array of non-empty strings.") + return value + + +def _positive_number(value: object, field_name: str, *, allow_zero: bool = False) -> None: + if isinstance(value, bool) or not isinstance(value, (int, float)): + raise ValueError(f"{field_name} must be a number.") + if value < 0 if allow_zero else value <= 0: + qualifier = "zero or greater" if allow_zero else "positive" + raise ValueError(f"{field_name} must be {qualifier}.") + + +def _validate_input_references(value: object, field_name: str) -> list[dict[str, str]]: + if not isinstance(value, list): + raise ValueError(f"{field_name} must be an array of input_id references.") + result: list[dict[str, str]] = [] + for item in value: + if isinstance(item, str): + input_id = item.strip() + elif isinstance(item, Mapping) and set(item) == {"input_id"}: + input_id = _string(item.get("input_id"), f"{field_name}.input_id") + else: + raise ValueError(f"{field_name} items must contain only input_id.") + if not INPUT_ID_PATTERN.fullmatch(input_id): + raise ValueError(f"{field_name} contains an invalid input_id.") + result.append({"input_id": input_id}) + return result + + +def _validate_callback(value: object) -> dict[str, object]: + if not isinstance(value, Mapping): + raise ValueError("callback must be an object.") + unsupported = set(value) - {"url", "metadata", "auth_token_from_env"} + if unsupported: + raise ValueError(f"Unsupported callback fields: {', '.join(sorted(unsupported))}.") + result: dict[str, object] = {"url": _string(value.get("url"), "callback.url")} + metadata = value.get("metadata") + if metadata is not None: + if not isinstance(metadata, Mapping) or not all( + isinstance(key, str) and isinstance(item, str) for key, item in metadata.items() + ): + raise ValueError("callback.metadata must map strings to strings.") + if contains_sensitive_key(metadata): + raise ValueError("callback.metadata must not contain secret-like keys.") + result["metadata"] = dict(metadata) + auth_env = value.get("auth_token_from_env") + if auth_env is not None: + result["auth_token_from_env"] = _string(auth_env, "callback.auth_token_from_env") + return result + + +def _validate_constraints(value: object) -> dict[str, object]: + if not isinstance(value, Mapping): + raise ValueError("constraints must be an object.") + unsupported = sorted(set(value) - CONSTRAINT_FIELDS) + if unsupported: + raise ValueError(f"Unsupported constraint fields: {', '.join(unsupported)}.") + result = dict(value) + if "max_price_per_hour" in result: + _positive_number(result["max_price_per_hour"], "constraints.max_price_per_hour") + if "gpu_class" in result and result["gpu_class"] not in VALID_GPU_CLASSES: + raise ValueError(f"constraints.gpu_class must be one of: {', '.join(sorted(VALID_GPU_CLASSES))}.") + if "region_mode" in result and result["region_mode"] not in VALID_REGION_MODES: + raise ValueError(f"constraints.region_mode must be one of: {', '.join(sorted(VALID_REGION_MODES))}.") + for field_name in ("latency_priority", "cost_priority"): + if field_name in result and result[field_name] not in {"low", "balanced", "high"}: + raise ValueError(f"constraints.{field_name} must be one of: balanced, high, low.") + if "avoid_gpu_families" in result: + result["avoid_gpu_families"] = _string_array(result["avoid_gpu_families"], "constraints.avoid_gpu_families") + for field_name in ("gpu_type", "preferred_gpu_family", "region_preference"): + if field_name in result: + result[field_name] = _string(result[field_name], f"constraints.{field_name}") + return result + + +def parse_workload_goal(goal: str) -> dict[str, object]: + """Parse and validate a project goal without resolving any secrets.""" text = goal.strip() if text.startswith("```"): text = re.sub(r"^```(?:json)?\s*", "", text) text = re.sub(r"\s*```$", "", text) try: - workload = json.loads(text) + raw = json.loads(text) except json.JSONDecodeError as exc: raise ValueError("Project goal must be a JSON object describing the Jungle Grid workload.") from exc - if not isinstance(workload, dict): + if not isinstance(raw, dict): raise ValueError("Project goal must be a JSON object.") - if SENSITIVE_PATTERN.search(json.dumps(workload)): - raise ValueError("Workload must not contain API keys or Bearer tokens.") - - unsupported = sorted(set(workload) - SUBMIT_FIELDS) + if SECRET_TEXT_PATTERN.search(json.dumps(raw)): + raise ValueError("Workload must not contain API keys, Bearer tokens, or signed URLs.") + unsupported = sorted(set(raw) - SUBMIT_FIELDS) if unsupported: raise ValueError(f"Unsupported workload fields: {', '.join(unsupported)}.") - required = {"name", "workload_type", "image"} - missing = sorted(key for key in required if not isinstance(workload.get(key), str) or not workload[key].strip()) - if missing: - raise ValueError(f"Missing required workload fields: {', '.join(missing)}.") - model_size_gb = workload.get("model_size_gb") - if not isinstance(model_size_gb, (int, float)) or isinstance(model_size_gb, bool) or model_size_gb <= 0: - raise ValueError("model_size_gb must be a positive number.") + + workload = dict(raw) + for required in ("name", "workload_type", "image"): + workload[required] = _string(workload.get(required), required) if workload["workload_type"] not in VALID_WORKLOAD_TYPES: raise ValueError(f"workload_type must be one of: {', '.join(sorted(VALID_WORKLOAD_TYPES))}.") - if "optimize_for" in workload and workload["optimize_for"] not in VALID_OPTIMIZE_FOR: - raise ValueError(f"optimize_for must be one of: {', '.join(sorted(VALID_OPTIMIZE_FOR))}.") - if "args" in workload and not ( - isinstance(workload["args"], list) and all(isinstance(item, str) for item in workload["args"]) - ): - raise ValueError("args must be an array of strings.") - if "environment_from_env" in workload and not ( - isinstance(workload["environment_from_env"], dict) - and all( + + command = workload.get("command") + args = workload.get("args") + if isinstance(command, str): + workload["command"] = _string(command, "command") + if args is not None: + workload["args"] = _string_array(args, "args") + elif isinstance(command, list): + workload["command"] = _string_array(command, "command") + if args is not None: + raise ValueError("args cannot be combined with the command-array format.") + elif command is not None: + raise ValueError("command must be a string or an array of strings.") + elif args is not None: + raise ValueError("args requires command.") + + for field_name in ("input_files", "script_files"): + if field_name in workload: + workload[field_name] = _validate_input_references(workload[field_name], field_name) + if "expected_artifacts" in workload: + paths = _string_array(workload["expected_artifacts"], "expected_artifacts") + if not all(path.startswith("/workspace/artifacts/") for path in paths): + raise ValueError("expected_artifacts must be paths under /workspace/artifacts/.") + workload["expected_artifacts"] = paths + if any(key in workload for key in ("local_path", "path", "file_path")): + raise ValueError("Arbitrary local file access is not supported.") + + env_refs = workload.get("environment_from_env") + if env_refs is not None and ( + not isinstance(env_refs, Mapping) + or not all( isinstance(key, str) and key.strip() and isinstance(value, str) and value.strip() - for key, value in workload["environment_from_env"].items() + for key, value in env_refs.items() ) ): - raise ValueError("environment_from_env must map workload variable names to local environment variable names.") - if _contains_sensitive_key(workload.get("metadata")): + raise ValueError("environment_from_env must map workload names to local environment names.") + if contains_sensitive_key(workload.get("metadata")): raise ValueError("metadata must not contain secret-like keys.") + if "callback" in workload: + workload["callback"] = _validate_callback(workload["callback"]) + if "gpu_required" in workload and not isinstance(workload["gpu_required"], bool): + raise ValueError("gpu_required must be a boolean.") + + for field_name in ("model_size_gb", "batch_size", "disk_gb", "gpu_count", "min_vram_gb", "max_price_per_hour"): + if field_name in workload: + _positive_number( + workload[field_name], field_name, allow_zero=field_name in {"batch_size", "disk_gb", "gpu_count"} + ) + if "timeout_seconds" in workload: + _positive_number(workload["timeout_seconds"], "timeout_seconds") + for field_name, allowed in ( + ("gpu_class", VALID_GPU_CLASSES), + ("region_mode", VALID_REGION_MODES), + ("precision", VALID_PRECISIONS), + ("priority", VALID_PRIORITIES), + ("latency_priority", {"low", "balanced", "high"}), + ("cost_priority", {"low", "balanced", "high"}), + ): + if field_name in workload and workload[field_name] not in allowed: + raise ValueError(f"{field_name} must be one of: {', '.join(sorted(allowed))}.") + optimize = workload.get("routing_mode", workload.get("optimize_for")) + if "routing_mode" in workload and "optimize_for" in workload: + raise ValueError("Use routing_mode or optimize_for, not both.") + if optimize is not None and optimize not in VALID_OPTIMIZE_FOR: + raise ValueError(f"routing preference must be one of: {', '.join(sorted(VALID_OPTIMIZE_FOR))}.") + if "avoid_gpu_families" in workload: + workload["avoid_gpu_families"] = _string_array(workload["avoid_gpu_families"], "avoid_gpu_families") + if "constraints" in workload: + workload["constraints"] = _validate_constraints(workload["constraints"]) return workload -def build_estimate_payload(workload: Dict[str, Any]) -> Dict[str, Any]: - """Build an estimate-only payload without submit-only or secret-bearing fields.""" - return {key: value for key, value in workload.items() if key in ESTIMATE_FIELDS} +def _api_workload_type(value: object) -> object: + return "fine-tuning" if value == "fine_tuning" else value -def build_submit_payload(workload: Dict[str, Any]) -> Dict[str, Any]: - """Build a submit payload, resolving secret environment values only at submission time.""" - payload = {key: value for key, value in workload.items() if key != "environment_from_env"} - references = workload.get("environment_from_env") - if not references: - return payload +def normalize_api_payload(workload: Mapping[str, object]) -> dict[str, object]: + """Convert goal compatibility aliases to the current Jungle Grid shape.""" + payload = copy.deepcopy(dict(workload)) + payload["workload_type"] = _api_workload_type(payload["workload_type"]) + if "routing_mode" in payload: + payload["optimize_for"] = payload.pop("routing_mode") + if isinstance(payload.get("command"), str): + legacy_args = payload.pop("args", []) + payload["command"] = [ + payload["command"], + *(legacy_args if isinstance(legacy_args, list) else []), + ] + return payload - missing = sorted(env_name for env_name in references.values() if not os.getenv(env_name)) - if missing: - raise ValueError(f"Missing required local environment variables: {', '.join(missing)}.") - payload["environment"] = {name: os.environ[env_name] for name, env_name in references.items()} + +def build_estimate_payload(workload: Mapping[str, object]) -> dict[str, object]: + payload = normalize_api_payload({key: value for key, value in workload.items() if key in ESTIMATE_FIELDS}) + callback = payload.get("callback") + if isinstance(callback, dict): + callback.pop("auth_token_from_env", None) return payload -def public_workload(workload: Dict[str, Any]) -> Dict[str, Any]: - """Return workload metadata safe to share in a project message or artifact.""" +def build_submit_payload(workload: Mapping[str, object]) -> tuple[dict[str, object], list[str]]: + """Resolve environment-backed secrets only after human approval.""" + payload = normalize_api_payload({key: value for key, value in workload.items() if key != "environment_from_env"}) + secrets: list[str] = [] + references = workload.get("environment_from_env") + if isinstance(references, Mapping): + missing = sorted(str(env_name) for env_name in references.values() if not os.getenv(str(env_name))) + if missing: + raise ValueError(f"Missing required local environment variables: {', '.join(missing)}.") + environment = {str(name): os.environ[str(env_name)] for name, env_name in references.items()} + payload["environment"] = environment + secrets.extend(environment.values()) + callback = payload.get("callback") + if isinstance(callback, dict): + auth_env = callback.pop("auth_token_from_env", None) + if auth_env: + token = os.getenv(str(auth_env)) + if not token: + raise ValueError(f"Missing required local environment variable: {auth_env}.") + callback["auth_token"] = token + secrets.append(token) + return payload, secrets + + +def public_workload(workload: Mapping[str, object]) -> dict[str, object]: result = dict(workload) - if "metadata" in result: - metadata = result["metadata"] - result["metadata"] = {key: "[REDACTED]" for key in metadata} if isinstance(metadata, dict) else "[REDACTED]" + metadata = result.get("metadata") + if isinstance(metadata, Mapping): + result["metadata"] = {str(key): "[REDACTED]" for key in metadata} return result -def lifecycle_label(status: str) -> str: - """Map Jungle Grid status to a user-facing lifecycle label.""" - if status == "assigned": - return "assigned (provisioning)" - return status +def estimate_can_submit(estimate: Mapping[str, object]) -> bool: + screening = estimate.get("screening") + if isinstance(screening, Mapping) and screening.get("can_submit") is False: + return False + return estimate.get("available") is not False and estimate.get("can_submit") is not False -def estimate_can_submit(estimate: Dict[str, Any]) -> bool: - """Return whether an estimate explicitly permits submission.""" - return estimate.get("available") is not False and estimate.get("can_submit") is not False +def estimate_summary(estimate: Mapping[str, object]) -> str: + """Build a compact summary without claiming immediate capacity.""" + parts: list[str] = [] + cost = estimate.get("estimated_cost_usd") + if cost is None: + minimum = estimate.get("estimated_cost_min_usd") + maximum = estimate.get("estimated_cost_max_usd") + if minimum is not None or maximum is not None: + cost = {"min": minimum, "max": maximum} + if cost is not None: + parts.append(f"estimated cost `{json.dumps(cost, sort_keys=True)}` USD") + duration_min = estimate.get("estimated_runtime_min_minutes") + duration_max = estimate.get("estimated_runtime_max_minutes") + if duration_min is not None or duration_max is not None: + parts.append(f"duration `{duration_min or '?'}-{duration_max or '?'}` minutes") + capacity = estimate.get("capacity_status") + if isinstance(capacity, Mapping): + if capacity.get("availability"): + parts.append(f"capacity `{capacity['availability']}`") + if capacity.get("immediate_capacity_confirmed") is False: + parts.append("immediate worker pickup not confirmed") + warnings = estimate.get("warnings") + if isinstance(warnings, list) and warnings: + parts.append(f"{len(warnings)} warning(s)") + return "; ".join(parts) if parts else "structured estimate stored in `jungle_grid_estimate`" + + +def status_fingerprint(job: Mapping[str, object]) -> str: + fields = ( + "status", + "execution_phase", + "status_message", + "phase_started_at", + "delayed_start", + "delay_reason", + "failure", + ) + return json.dumps({key: job.get(key) for key in fields}, sort_keys=True, default=str) @dataclass class ProjectExecution: - """State tracked between estimate, approval, submission, and completion.""" - project_id: str - workload: Dict[str, Any] + workload: dict[str, object] estimate_id: str - estimate: Dict[str, Any] + estimate: dict[str, object] job_id: Optional[str] = None - last_status: Optional[str] = None approved_by: Optional[str] = None - submission_started: bool = False - submit_payload: Optional[Dict[str, Any]] = None - secret_values: Optional[list[str]] = None + submission_state: str = "pending" + cancel_requested: bool = False + terminal: bool = False + last_status_fingerprint: Optional[str] = None + log_cursor: Optional[str] = None + seen_event_ids: list[str] = field(default_factory=list) + logs: list[object] = field(default_factory=list) + events: list[object] = field(default_factory=list) + secret_values: list[str] = field(default_factory=list, repr=False) + + def persisted(self) -> dict[str, object]: + data = asdict(self) + data.pop("secret_values", None) + return data + + @classmethod + def from_persisted(cls, value: Mapping[str, object]) -> ProjectExecution: + allowed = cls.__dataclass_fields__.keys() + return cls(**{key: value[key] for key in allowed if key in value}) # type: ignore[arg-type] class JungleGridExecutorAgent(WorkerAgent): - """Execute approved Jungle Grid workloads and report results to an OpenAgents project.""" + """Deterministic executor for the Jungle Grid project demo.""" default_agent_id = "jungle-grid-executor" @@ -325,208 +616,297 @@ def __init__( self, jungle_grid_client: Optional[JungleGridClient] = None, poll_interval_seconds: float = 10.0, + max_poll_failures: int = 3, + sleep: Callable[[float], Awaitable[None]] = asyncio.sleep, **kwargs: Any, ): super().__init__(**kwargs) self.jungle_grid = jungle_grid_client or JungleGridClient() - self.poll_interval_seconds = poll_interval_seconds + self.poll_interval_seconds = max(0.0, poll_interval_seconds) + self.max_poll_failures = max(1, max_poll_failures) + self.sleep = sleep self.project_adapter = DefaultProjectAgentAdapter() - self.executions: Dict[str, ProjectExecution] = {} - self.monitor_tasks: Dict[str, asyncio.Task] = {} + self.executions: dict[str, ProjectExecution] = {} + self.monitor_tasks: dict[str, asyncio.Task[None]] = {} + self.project_locks: dict[str, asyncio.Lock] = {} - async def on_startup(self): - """Bind the project adapter after the OpenAgents client is connected.""" + async def on_startup(self) -> None: self.project_adapter.bind_client(self.client) + if self.client.connector is None: + raise RuntimeError("OpenAgents connector is unavailable during startup.") self.project_adapter.bind_connector(self.client.connector) self.project_adapter.bind_agent(self.agent_id) logger.info("Jungle Grid executor is ready") - async def on_shutdown(self): - """Stop local monitor tasks without cancelling remote jobs.""" + async def on_shutdown(self) -> None: for task in self.monitor_tasks.values(): task.cancel() if self.monitor_tasks: await asyncio.gather(*self.monitor_tasks.values(), return_exceptions=True) - async def _post(self, project_id: str, text: str): + async def _post(self, project_id: str, text: str) -> None: await self.project_adapter.send_project_message(project_id=project_id, content={"text": text}) - async def _set_artifact(self, project_id: str, key: str, value: Dict[str, Any]): + async def _set_artifact(self, project_id: str, key: str, value: object) -> None: + safe = sanitize_project_data(value, [self.jungle_grid.api_key]) await self.project_adapter.set_project_artifact( - project_id=project_id, key=key, value=json.dumps(value, indent=2) + project_id=project_id, key=key, value=json.dumps(safe, indent=2, sort_keys=True) ) - def _project_secrets(self, execution: ProjectExecution) -> list[str]: - return [ - self.jungle_grid.api_key, - *(execution.secret_values or []), - *_collect_string_values(execution.workload.get("metadata")), - ] + async def _save_state(self, execution: ProjectExecution) -> None: + await self._set_artifact(execution.project_id, STATE_ARTIFACT, execution.persisted()) + + async def _load_state(self, project_id: str) -> Optional[ProjectExecution]: + if project_id in self.executions: + return self.executions[project_id] + response = await self.project_adapter.get_project_artifact(project_id=project_id, key=STATE_ARTIFACT) + if not response.get("success"): + return None + value = response.get("data", {}).get("value") + if not isinstance(value, str) or not value.strip(): + return None + try: + raw = json.loads(value) + if not isinstance(raw, dict): + return None + execution = ProjectExecution.from_persisted(raw) + except (TypeError, ValueError, json.JSONDecodeError): + return None + self.executions[project_id] = execution + return execution - def _sanitize_for_project(self, value: Any, execution: ProjectExecution) -> Any: - return sanitize_project_data(value, self._project_secrets(execution)) + def _secrets(self, execution: ProjectExecution) -> list[str]: + return [self.jungle_grid.api_key, *execution.secret_values] - def _is_human_approver(self, sender_id: str) -> bool: - return sender_id.startswith("human:") + def _safe(self, value: object, execution: ProjectExecution) -> object: + return sanitize_project_data(value, self._secrets(execution)) + + @staticmethod + def _is_human(sender_id: str) -> bool: + return sender_id.startswith("human:") and len(sender_id) > len("human:") @on_event("project.notification.started") - async def handle_project_started(self, context: EventContext): - """Estimate a workload and request human approval without submitting it.""" + async def handle_project_started(self, context: EventContext) -> None: payload = context.incoming_event.payload project_id = payload.get("project_id") - goal = payload.get("goal", "") - if not project_id: + if not isinstance(project_id, str) or not project_id: return - try: - workload = parse_workload_goal(goal) - estimate = await self.jungle_grid.estimate_job(build_estimate_payload(workload)) - estimate_id = uuid.uuid4().hex[:12] - execution = ProjectExecution(project_id, workload, estimate_id, estimate) - self.executions[project_id] = execution - shared_workload = self._sanitize_for_project(public_workload(workload), execution) - shared_estimate = self._sanitize_for_project(estimate, execution) - await self._set_artifact( - project_id, - "jungle_grid_estimate", - {"estimate_id": estimate_id, "workload": shared_workload, "estimate": shared_estimate}, - ) - if not estimate_can_submit(estimate): + lock = self.project_locks.setdefault(project_id, asyncio.Lock()) + async with lock: + existing = await self._load_state(project_id) + if existing: + if existing.job_id and not existing.terminal: + self._ensure_monitor(existing) + return + try: + workload = parse_workload_goal(str(payload.get("goal", ""))) + estimate = await self.jungle_grid.estimate_job(build_estimate_payload(workload)) + execution = ProjectExecution( + project_id=project_id, + workload=workload, + estimate_id=uuid.uuid4().hex[:12], + estimate=estimate, + ) + self.executions[project_id] = execution + await self._save_state(execution) + shared = { + "estimate_id": execution.estimate_id, + "workload": public_workload(workload), + "estimate": estimate, + } + await self._set_artifact(project_id, "jungle_grid_estimate", shared) + if not estimate_can_submit(estimate): + await self._post(project_id, "Jungle Grid screening blocked submission. No job was submitted.") + await self.project_adapter.stop_project( + project_id=project_id, reason="Jungle Grid screening blocked submission" + ) + return await self._post( project_id, - "Jungle Grid estimate is not currently eligible for submission.\n\n" - f"```json\n{json.dumps({'estimate_id': estimate_id, 'workload': shared_workload, 'estimate': shared_estimate}, indent=2)}\n```", + "Jungle Grid estimate ready. No job has been submitted. " + f"Summary: {estimate_summary(estimate)}.\n\n" + f"A human must reply exactly `APPROVE {execution.estimate_id}` " + "before billable compute can start.", ) - await self.project_adapter.stop_project( - project_id=project_id, reason="Jungle Grid estimate is not eligible for submission" + except (ValueError, JungleGridError) as exc: + await self._post( + project_id, + f"Jungle Grid estimate failed: {redact_sensitive(exc, [self.jungle_grid.api_key])}", ) - return - await self._post( - project_id, - "Jungle Grid estimate ready. No job has been submitted.\n\n" - f"```json\n{json.dumps({'estimate_id': estimate_id, 'workload': shared_workload, 'estimate': shared_estimate}, indent=2)}\n```\n\n" - f"A human must reply exactly `APPROVE {estimate_id}` before billable compute can start.", - ) - except (ValueError, JungleGridError) as exc: - await self._post( - project_id, f"Jungle Grid estimate failed: {redact_sensitive(exc, self.jungle_grid.api_key)}" - ) - await self.project_adapter.stop_project(project_id=project_id, reason="Jungle Grid estimate failed") + await self.project_adapter.stop_project(project_id=project_id, reason="Jungle Grid estimate failed") @on_event("project.notification.message_received") - async def handle_project_message(self, context: EventContext): - """Handle explicit approval and cancellation commands.""" + async def handle_project_message(self, context: EventContext) -> None: payload = context.incoming_event.payload project_id = payload.get("project_id") sender_id = str(payload.get("sender_id", "")) - content = payload.get("content", {}) - text = content.get("text", "") if isinstance(content, dict) else "" - if not project_id or not isinstance(text, str): + content = payload.get("content") + text = content.get("text") if isinstance(content, Mapping) else None + if not isinstance(project_id, str) or not isinstance(text, str): return - command = text - execution = self.executions.get(project_id) - - if command.startswith("APPROVE "): - if not execution: - await self._post(project_id, "There is no pending Jungle Grid estimate for this project.") - return - if not self._is_human_approver(sender_id): - await self._post( - project_id, "Approval rejected: billable Jungle Grid submission requires a human approver." - ) - return - if command != f"APPROVE {execution.estimate_id}": - await self._post(project_id, "Approval rejected: estimate id does not match the pending estimate.") - return - if execution.submission_started: - suffix = f" as job `{execution.job_id}`" if execution.job_id else "" - await self._post(project_id, f"Jungle Grid submission has already been requested{suffix}.") - return - await self._submit_and_monitor(execution, sender_id) + normalized_prefix = text.strip() + if not normalized_prefix.startswith(("APPROVE", "CANCEL")): return + lock = self.project_locks.setdefault(project_id, asyncio.Lock()) + async with lock: + execution = await self._load_state(project_id) + if normalized_prefix.startswith("APPROVE"): + await self._handle_approval(project_id, sender_id, text, execution) + else: + await self._handle_cancellation(project_id, sender_id, text, execution) + + async def _handle_approval( + self, + project_id: str, + sender_id: str, + command: str, + execution: Optional[ProjectExecution], + ) -> None: + if not execution: + await self._post(project_id, "There is no pending Jungle Grid estimate for this project.") + return + if not self._is_human(sender_id): + await self._post(project_id, "Approval rejected: billable submission requires a verified human identity.") + return + if command != f"APPROVE {execution.estimate_id}": + await self._post(project_id, "Approval rejected: estimate id does not match the pending estimate.") + return + if execution.terminal or execution.submission_state != "pending": + suffix = f" as job `{execution.job_id}`" if execution.job_id else "" + await self._post(project_id, f"Jungle Grid submission has already been recorded{suffix}.") + return + await self._submit(execution, sender_id) - if command.startswith("CANCEL "): - if not execution or not execution.job_id: - await self._post(project_id, "There is no submitted Jungle Grid job to cancel for this project.") - return - if command != f"CANCEL {execution.job_id}": - await self._post(project_id, "Cancellation rejected: job id does not match this project.") - return - if not self._is_human_approver(sender_id): - await self._post( - project_id, "Cancellation rejected: Jungle Grid cancellation requires a human approver." - ) - return - try: - result = await self.jungle_grid.cancel_job( - execution.job_id, f"Requested from OpenAgents by {sender_id}" - ) - shared_result = self._sanitize_for_project(result, execution) - await self._post( - project_id, - f"Cancellation requested for Jungle Grid job `{execution.job_id}`.\n\n```json\n{json.dumps(shared_result, indent=2)}\n```", - ) - except JungleGridError as exc: - await self._post( - project_id, f"Jungle Grid cancellation failed: {redact_sensitive(exc, self.jungle_grid.api_key)}" - ) - - async def _submit_and_monitor(self, execution: ProjectExecution, approved_by: str): - execution.submission_started = True + async def _submit(self, execution: ProjectExecution, approved_by: str) -> None: + execution.submission_state = "submitting" execution.approved_by = approved_by + await self._save_state(execution) try: - execution.submit_payload = build_submit_payload(execution.workload) - execution.secret_values = _collect_string_values(execution.submit_payload.get("environment")) - result = await self.jungle_grid.submit_job(execution.submit_payload) + submit_payload, secrets = build_submit_payload(execution.workload) + execution.secret_values = secrets + result = await self.jungle_grid.submit_job(submit_payload) job_id = str(result.get("job_id") or result.get("id") or "").strip() if not job_id: raise JungleGridError("INVALID_API_RESPONSE", "Jungle Grid submit response did not include a job id.") execution.job_id = job_id - execution.last_status = str(result.get("status") or "submitted") + execution.submission_state = "submitted" + execution.last_status_fingerprint = status_fingerprint(result) + await self._save_state(execution) await self._set_artifact( execution.project_id, "jungle_grid_submission", { "approved_by": approved_by, "estimate_id": execution.estimate_id, - "submission": self._sanitize_for_project(result, execution), + "submission": self._safe(result, execution), }, ) await self._post( execution.project_id, - f"Jungle Grid job submitted after approval by `{approved_by}`: `{job_id}` " - f"(status: `{lifecycle_label(execution.last_status)}`).", + f"Jungle Grid job submitted after approval by `{approved_by}`: `{job_id}`.", ) - task = asyncio.create_task(self._monitor(execution)) - self.monitor_tasks[execution.project_id] = task + self._ensure_monitor(execution) except (ValueError, JungleGridError) as exc: + execution.submission_state = "submission_failed" + await self._save_state(execution) await self._post( execution.project_id, - f"Jungle Grid submission failed: {redact_sensitive(exc, self.jungle_grid.api_key)}", + f"Jungle Grid submission failed: {redact_sensitive(exc, self._secrets(execution))}", ) await self.project_adapter.stop_project( project_id=execution.project_id, reason="Jungle Grid submission failed" ) - async def _monitor(self, execution: ProjectExecution): + async def _handle_cancellation( + self, + project_id: str, + sender_id: str, + command: str, + execution: Optional[ProjectExecution], + ) -> None: + if not execution or not execution.job_id: + await self._post(project_id, "There is no submitted Jungle Grid job to cancel for this project.") + return + if command != f"CANCEL {execution.job_id}": + await self._post(project_id, "Cancellation rejected: job id does not match this project.") + return + if not self._is_human(sender_id): + await self._post(project_id, "Cancellation rejected: cancellation requires a verified human identity.") + return + if execution.terminal: + await self._post( + project_id, "Cancellation was not sent because this project already recorded a terminal job." + ) + return + if execution.cancel_requested: + await self._post(project_id, "Cancellation has already been requested for this job.") + return + execution.cancel_requested = True + await self._save_state(execution) + try: + result = await self.jungle_grid.cancel_job(execution.job_id, f"Requested from OpenAgents by {sender_id}") + await self._post( + project_id, + f"Cancellation requested for Jungle Grid job `{execution.job_id}`: " + f"{json.dumps(self._safe(result, execution), sort_keys=True)}", + ) + if str(result.get("status", "")).lower() in TERMINAL_STATUSES: + execution.terminal = True + await self._save_state(execution) + await self.project_adapter.stop_project( + project_id=project_id, reason=f"Jungle Grid job {execution.job_id} was cancelled." + ) + except JungleGridError as exc: + execution.cancel_requested = False + await self._save_state(execution) + await self._post( + project_id, + f"Jungle Grid cancellation failed: {redact_sensitive(exc, self._secrets(execution))}", + ) + + def _ensure_monitor(self, execution: ProjectExecution) -> None: + current = self.monitor_tasks.get(execution.project_id) + if current and not current.done(): + return + self.monitor_tasks[execution.project_id] = asyncio.create_task(self._monitor(execution)) + + async def _monitor(self, execution: ProjectExecution) -> None: assert execution.job_id + failures = 0 try: - while True: - job = await self.jungle_grid.get_job(execution.job_id) - status = str(job.get("status") or "unknown") - if status != execution.last_status: - execution.last_status = status + while not execution.terminal: + try: + job = await self.jungle_grid.get_job(execution.job_id) + await self._collect_events(execution) + await self._collect_logs(execution) + failures = 0 + except JungleGridError as exc: + failures += 1 + if failures >= self.max_poll_failures: + raise exc + await self.sleep(self.poll_interval_seconds) + continue + fingerprint = status_fingerprint(job) + if fingerprint != execution.last_status_fingerprint: + execution.last_status_fingerprint = fingerprint + status = str(job.get("status") or "unknown") + phase = job.get("execution_phase") + delayed = " (delayed start)" if job.get("delayed_start") is True else "" + phase_text = f", phase `{phase}`" if phase else "" await self._post( execution.project_id, - f"Jungle Grid job `{execution.job_id}` is now `{lifecycle_label(status)}`.", + f"Jungle Grid job `{execution.job_id}` is `{status}`{phase_text}{delayed}.", ) - if status in TERMINAL_STATUSES: + await self._save_state(execution) + if str(job.get("status", "")).lower() in TERMINAL_STATUSES: await self._finalize(execution, job) return - await asyncio.sleep(self.poll_interval_seconds) + await self.sleep(self.poll_interval_seconds) except JungleGridError as exc: await self._post( execution.project_id, - f"Jungle Grid monitoring failed: {redact_sensitive(exc, self.jungle_grid.api_key)}", + f"Jungle Grid monitoring failed after bounded retries: " + f"{redact_sensitive(exc, self._secrets(execution))}", ) await self.project_adapter.stop_project( project_id=execution.project_id, reason="Jungle Grid monitoring failed" @@ -534,44 +914,73 @@ async def _monitor(self, execution: ProjectExecution): finally: self.monitor_tasks.pop(execution.project_id, None) - async def _finalize(self, execution: ProjectExecution, job: Dict[str, Any]): + async def _collect_events(self, execution: ProjectExecution) -> None: assert execution.job_id - runtime: Dict[str, Any] = {} - logs: Dict[str, Any] = {} - artifacts: Dict[str, Any] = {} - downloads = [] + response = await self.jungle_grid.get_job_events(execution.job_id) + items = response.get("items") + if not isinstance(items, list): + return + seen = set(execution.seen_event_ids) + new_items: list[object] = [] + for item in items: + if not isinstance(item, Mapping): + continue + event_id = str(item.get("id") or item.get("sequence") or item.get("created_at") or "") + if not event_id or event_id in seen: + continue + seen.add(event_id) + execution.seen_event_ids.append(event_id) + new_items.append(self._safe(item, execution)) + if new_items: + execution.events = (execution.events + new_items)[-MAX_SHARED_EVENTS:] + latest = new_items[-1] + title = latest.get("title") if isinstance(latest, Mapping) else None + if title: + await self._post(execution.project_id, f"Jungle Grid lifecycle: {title}.") + + async def _collect_logs(self, execution: ProjectExecution) -> None: + assert execution.job_id + response = await self.jungle_grid.get_job_logs(execution.job_id, limit=100, cursor=execution.log_cursor) + items = response.get("items", response.get("logs")) + if isinstance(items, list) and items: + safe_items = self._safe(items, execution) + if isinstance(safe_items, list): + execution.logs = (execution.logs + safe_items)[-MAX_SHARED_LOGS:] + next_cursor = response.get("next_cursor") + if next_cursor is not None and str(next_cursor) != execution.log_cursor: + execution.log_cursor = str(next_cursor) + + async def _finalize(self, execution: ProjectExecution, job: dict[str, object]) -> None: + assert execution.job_id + runtime: object = {} + artifacts: object = {} try: runtime = await self.jungle_grid.get_job_runtime(execution.job_id) except JungleGridError as exc: - runtime = {"error": redact_sensitive(exc, self.jungle_grid.api_key)} - try: - logs = await self.jungle_grid.get_job_logs(execution.job_id) - except JungleGridError as exc: - logs = {"error": redact_sensitive(exc, self.jungle_grid.api_key)} + if exc.status not in {404, 409}: + runtime = {"unavailable": redact_sensitive(exc, self._secrets(execution))} + else: + runtime = {"unavailable": "Runtime details are not available for this job."} try: artifacts = await self.jungle_grid.list_artifacts(execution.job_id) - for artifact in artifacts.get("artifacts", []): - if not isinstance(artifact, dict): - continue - artifact_id = str(artifact.get("artifact_id") or artifact.get("id") or "").strip() - if artifact_id: - download = await self.jungle_grid.get_artifact(execution.job_id, artifact_id) - if "url" in download: - download = {**download, "url": "[REDACTED]"} - downloads.append(download) except JungleGridError as exc: - artifacts = {"error": redact_sensitive(exc, self.jungle_grid.api_key)} - - result = self._sanitize_for_project( - {"job": job, "runtime": runtime, "logs": logs, "artifacts": artifacts, "downloads": downloads}, - execution, - ) + artifacts = {"unavailable": redact_sensitive(exc, self._secrets(execution))} + result = { + "job": self._safe(job, execution), + "events": execution.events, + "logs": execution.logs, + "runtime": self._safe(runtime, execution), + "artifacts": self._safe(artifacts, execution), + } await self._set_artifact(execution.project_id, "jungle_grid_result", result) - status = str(job.get("status") or "unknown") + execution.terminal = True + await self._save_state(execution) + status = str(job.get("status") or "unknown").lower() await self._post( execution.project_id, f"Jungle Grid job `{execution.job_id}` finished with status `{status}`. " - "Logs and artifact metadata are stored in project artifact `jungle_grid_result`.", + "Sanitized lifecycle events, polled logs, runtime details, and artifact metadata are in " + "`jungle_grid_result`. Temporary download URLs are intentionally not requested or stored.", ) if status == "completed": await self.project_adapter.complete_project( @@ -585,10 +994,12 @@ async def _finalize(self, execution: ProjectExecution, job: Dict[str, Any]): ) -async def main(): - """Run the Jungle Grid executor agent.""" +async def main() -> None: logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s") - agent = JungleGridExecutorAgent() + agent = JungleGridExecutorAgent( + poll_interval_seconds=float(os.getenv("JUNGLE_GRID_POLL_INTERVAL_SECONDS", "10")), + max_poll_failures=int(os.getenv("JUNGLE_GRID_MAX_POLL_FAILURES", "3")), + ) try: await agent.async_start( network_host="localhost", diff --git a/sdk/demos/09_jungle_grid_gpu_execution/network.yaml b/sdk/demos/09_jungle_grid_gpu_execution/network.yaml index 30c0f5012..8bfb1f445 100644 --- a/sdk/demos/09_jungle_grid_gpu_execution/network.yaml +++ b/sdk/demos/09_jungle_grid_gpu_execution/network.yaml @@ -20,6 +20,8 @@ network: agent_groups: executors: description: Agents allowed to execute Jungle Grid project workflows + # Demo-only group credential used to establish runtime topology membership. + # Replace it before adapting this network for a shared or public deployment. password_hash: 8fba13dab71d6fdd8a9b9db1f06e81315dfbfd69167b6097f724604db3c91cdf metadata: permissions: @@ -39,7 +41,7 @@ network: description: Estimate, approve, execute, and monitor an AI workload on Jungle Grid expose_as_tool: true tool_name: run_jungle_grid_workload - tool_description: Start a Jungle Grid workload project. The task must be a JSON object with name, workload_type, image, and model_size_gb; use environment_from_env for workload environment values. + tool_description: Start a human-approved Jungle Grid workload project. The task must be a JSON object with name, workload_type, and image; use uploaded input_id references and environment_from_env for secret workload values. tool_mode: async agent_groups: - executors @@ -48,7 +50,8 @@ network: The executor estimates cost first and will not submit a job until a human replies with the exact approval command shown in the project. Do not put credentials in the goal; use environment_from_env to reference variables - available only in the executor process. + available only in the executor process. File jobs accept previously + uploaded Jungle Grid input_id references and never read arbitrary host paths. created_by_version: 0.9.3 network_profile: From b14a0f361a565707d9b5cd74e2e3e21c3b9435c9 Mon Sep 17 00:00:00 2001 From: dejaguarkyng Date: Thu, 11 Jun 2026 15:18:07 +0000 Subject: [PATCH 4/5] test: expand jungle grid execution safety coverage --- tests/agents/test_jungle_grid_executor.py | 791 ++++++++++++++-------- 1 file changed, 494 insertions(+), 297 deletions(-) diff --git a/tests/agents/test_jungle_grid_executor.py b/tests/agents/test_jungle_grid_executor.py index aa9bf248e..b3466bc7f 100644 --- a/tests/agents/test_jungle_grid_executor.py +++ b/tests/agents/test_jungle_grid_executor.py @@ -1,8 +1,9 @@ -"""Mocked tests for the Jungle Grid GPU execution demo agent.""" +"""Mocked safety and contract tests for the Jungle Grid execution demo.""" import asyncio import importlib.util import json +import sys from pathlib import Path from types import SimpleNamespace from unittest.mock import AsyncMock @@ -29,6 +30,7 @@ SPEC = importlib.util.spec_from_file_location("jungle_grid_executor", MODULE_PATH) MODULE = importlib.util.module_from_spec(SPEC) assert SPEC and SPEC.loader +sys.modules[SPEC.name] = MODULE SPEC.loader.exec_module(MODULE) JungleGridClient = MODULE.JungleGridClient @@ -36,10 +38,10 @@ JungleGridExecutorAgent = MODULE.JungleGridExecutorAgent ProjectExecution = MODULE.ProjectExecution EXECUTORS_GROUP_PASSWORD_HASH = MODULE.EXECUTORS_GROUP_PASSWORD_HASH +STATE_ARTIFACT = MODULE.STATE_ARTIFACT build_estimate_payload = MODULE.build_estimate_payload build_submit_payload = MODULE.build_submit_payload estimate_can_submit = MODULE.estimate_can_submit -lifecycle_label = MODULE.lifecycle_label parse_workload_goal = MODULE.parse_workload_goal public_workload = MODULE.public_workload redact_sensitive = MODULE.redact_sensitive @@ -54,50 +56,88 @@ def context(event_name, payload): ) -def workload(): - return { - "name": "batch-demo", - "workload_type": "batch", - "image": "python:3.11-slim", +def workload(**updates): + value = { + "name": "training-demo", + "workload_type": "training", + "image": "pytorch/pytorch:2.4.0-cuda12.1-cudnn9-runtime", + "command": ["python", "-c", "print(42)"], "model_size_gb": 1, - "command": "python", - "args": ["-c", "print(42)"], - "optimize_for": "cost", + "routing_mode": "cost", } + value.update(updates) + return value class FakeJungleGridClient: def __init__(self): - self.api_key = "test-api-key" - self.estimate_job = AsyncMock(return_value={"available": True, "estimated_cost_usd": {"min": 0.1, "max": 0.2}}) + self.api_key = "jg_test_api_key" + self.estimate_job = AsyncMock( + return_value={ + "available": True, + "screening": {"can_submit": True}, + "capacity_status": {"immediate_capacity_confirmed": False}, + } + ) self.submit_job = AsyncMock(return_value={"job_id": "job_123", "status": "queued"}) self.get_job = AsyncMock(return_value={"job_id": "job_123", "status": "completed"}) + self.get_job_events = AsyncMock( + return_value={ + "items": [ + { + "id": "evt_1", + "type": "job.completed", + "title": "Job completed", + "message": "done", + "created_at": "2026-06-11T00:00:00Z", + } + ] + } + ) + self.get_job_logs = AsyncMock( + return_value={ + "items": [{"category": "workload_stdout", "message": "done"}], + "next_cursor": None, + } + ) self.get_job_runtime = AsyncMock(return_value={"exit_code": 0, "stdout_tail": "done"}) - self.get_job_logs = AsyncMock(return_value={"items": [{"message": "done"}]}) - self.cancel_job = AsyncMock(return_value={"job_id": "job_123", "status": "cancelled", "cancelled": True}) + self.cancel_job = AsyncMock(return_value={"job_id": "job_123", "status": "cancelled"}) self.list_artifacts = AsyncMock( - return_value={"artifacts": [{"artifact_id": "artifact_1", "filename": "output.json"}]} - ) - self.get_artifact = AsyncMock( return_value={ - "artifact": {"artifact_id": "artifact_1", "filename": "output.json"}, - "url": "https://example.test/file", + "artifacts": [ + { + "artifact_id": "artifact_1", + "filename": "output.json", + "content_type": "application/json", + "size_bytes": 12, + } + ] } ) + self.get_artifact = AsyncMock(return_value={"download_url": "https://storage.example/file?signature=secret"}) def agent_with_mocks(fake=None): - agent = JungleGridExecutorAgent(jungle_grid_client=fake or FakeJungleGridClient(), poll_interval_seconds=0) + agent = JungleGridExecutorAgent( + jungle_grid_client=fake or FakeJungleGridClient(), + poll_interval_seconds=0, + sleep=AsyncMock(), + ) agent.project_adapter = AsyncMock() agent.project_adapter.send_project_message = AsyncMock(return_value={"success": True}) agent.project_adapter.set_project_artifact = AsyncMock(return_value={"success": True}) + agent.project_adapter.get_project_artifact = AsyncMock(return_value={"success": True, "data": {"value": None}}) agent.project_adapter.complete_project = AsyncMock(return_value={"success": True}) agent.project_adapter.stop_project = AsyncMock(return_value={"success": True}) return agent +def message_texts(agent): + return [call.kwargs["content"]["text"] for call in agent.project_adapter.send_project_message.await_args_list] + + @pytest.mark.asyncio -async def test_executor_group_membership_delivers_project_start_and_returns_estimate(): +async def test_group_authentication_runtime_membership_and_project_delivery(): network_yaml = yaml.safe_load(NETWORK_CONFIG_PATH.read_text()) executor_group = network_yaml["network"]["agent_groups"]["executors"] assert executor_group["password_hash"] == EXECUTORS_GROUP_PASSWORD_HASH @@ -137,17 +177,11 @@ async def test_executor_group_membership_delivers_project_start_and_returns_esti fake = FakeJungleGridClient() executor = agent_with_mocks(fake) - delivered = [] async def deliver(event): - delivered.append(event) if event.destination_id == "jungle-grid-executor": await executor.handle_project_started( - EventContext( - incoming_event=event, - event_threads={}, - incoming_thread_id="project-start", - ) + EventContext(incoming_event=event, event_threads={}, incoming_thread_id="start") ) return SimpleNamespace(success=True) @@ -155,7 +189,7 @@ async def deliver(event): response = await project_mod.process_system_message( Event( event_name="project.start", - source_id="human:project-owner", + source_id="human:owner", payload={ "template_id": "jungle_grid_execution", "goal": json.dumps(workload()), @@ -163,364 +197,478 @@ async def deliver(event): }, ) ) - assert response.success assert "jungle-grid-executor" in response.data["authorized_agents"] - assert any( - event.event_name == "project.notification.started" - and event.destination_id == "jungle-grid-executor" - and event.payload["initiator_agent_id"] == "human:project-owner" - for event in delivered - ) - fake.estimate_job.assert_awaited_once_with(build_estimate_payload(workload())) - estimate_message = executor.project_adapter.send_project_message.await_args.kwargs["content"]["text"] - assert "Jungle Grid estimate ready" in estimate_message - assert "APPROVE" in estimate_message + fake.estimate_job.assert_awaited_once() @pytest.mark.asyncio -async def test_successful_estimate_flow_posts_estimate_and_requires_approval(): +async def test_estimate_never_submits_and_requires_exact_human_approval(): fake = FakeJungleGridClient() agent = agent_with_mocks(fake) - await agent.handle_project_started( context("project.notification.started", {"project_id": "project-1", "goal": json.dumps(workload())}) ) - - fake.estimate_job.assert_awaited_once_with(build_estimate_payload(workload())) fake.submit_job.assert_not_awaited() - assert "project-1" in agent.executions - message = agent.project_adapter.send_project_message.await_args.kwargs["content"]["text"] - assert "No job has been submitted" in message - assert "APPROVE" in message + assert any("No job has been submitted" in text and "APPROVE" in text for text in message_texts(agent)) @pytest.mark.asyncio -async def test_unavailable_estimate_never_requests_approval_or_submits(): +async def test_screening_can_submit_false_blocks_approval(): fake = FakeJungleGridClient() - fake.estimate_job = AsyncMock(return_value={"available": False, "can_submit": False}) + fake.estimate_job.return_value = { + "available": True, + "screening": {"can_submit": False, "blocked_checks": ["resource"]}, + } agent = agent_with_mocks(fake) - await agent.handle_project_started( context("project.notification.started", {"project_id": "project-1", "goal": json.dumps(workload())}) ) - fake.submit_job.assert_not_awaited() - message = agent.project_adapter.send_project_message.await_args.kwargs["content"]["text"] - assert "not currently eligible for submission" in message - assert "APPROVE" not in message agent.project_adapter.stop_project.assert_awaited_once() + assert not any("APPROVE" in text for text in message_texts(agent)) @pytest.mark.asyncio -async def test_approval_required_before_submit_and_non_human_is_rejected(): +@pytest.mark.parametrize( + ("sender", "command"), + [ + ("agent:other", "APPROVE estimate-1"), + ("human:user", "APPROVE wrong"), + ("human:user", " APPROVE estimate-1"), + ("human:user", "APPROVE estimate-1\n"), + ], +) +async def test_unauthorized_or_malformed_approval_is_rejected(sender, command): fake = FakeJungleGridClient() agent = agent_with_mocks(fake) - execution = ProjectExecution("project-1", workload(), "estimate-1", {"available": True}) - agent.executions["project-1"] = execution - + agent.executions["project-1"] = ProjectExecution("project-1", workload(), "estimate-1", {"available": True}) await agent.handle_project_message( context( "project.notification.message_received", - {"project_id": "project-1", "sender_id": "agent:other", "content": {"text": "APPROVE estimate-1"}}, + {"project_id": "project-1", "sender_id": sender, "content": {"text": command}}, ) ) - fake.submit_job.assert_not_awaited() - assert ( - "requires a human approver" in agent.project_adapter.send_project_message.await_args.kwargs["content"]["text"] - ) @pytest.mark.asyncio -@pytest.mark.parametrize("command", ["APPROVE estimate-2", " APPROVE estimate-1", "APPROVE estimate-1\n"]) -async def test_approval_requires_exact_command(command): +async def test_duplicate_and_concurrent_approval_submit_only_once(): fake = FakeJungleGridClient() - agent = agent_with_mocks(fake) - execution = ProjectExecution("project-1", workload(), "estimate-1", {"available": True}) - agent.executions["project-1"] = execution + started = asyncio.Event() + release = asyncio.Event() - await agent.handle_project_message( - context( - "project.notification.message_received", - {"project_id": "project-1", "sender_id": "human:user", "content": {"text": command}}, - ) + async def delayed_submit(_payload): + started.set() + await release.wait() + return {"job_id": "job_123", "status": "queued"} + + fake.submit_job.side_effect = delayed_submit + agent = agent_with_mocks(fake) + agent._ensure_monitor = lambda execution: None + agent.executions["project-1"] = ProjectExecution("project-1", workload(), "estimate-1", {"available": True}) + approval = context( + "project.notification.message_received", + { + "project_id": "project-1", + "sender_id": "human:user", + "content": {"text": "APPROVE estimate-1"}, + }, ) + first = asyncio.create_task(agent.handle_project_message(approval)) + await started.wait() + second = asyncio.create_task(agent.handle_project_message(approval)) + release.set() + await asyncio.gather(first, second) + fake.submit_job.assert_awaited_once() + +@pytest.mark.asyncio +async def test_restart_recovers_submitted_state_without_resubmitting(): + fake = FakeJungleGridClient() + agent = agent_with_mocks(fake) + persisted = ProjectExecution( + "project-1", + workload(), + "estimate-1", + {"available": True}, + job_id="job_existing", + submission_state="submitted", + ) + agent.project_adapter.get_project_artifact.return_value = { + "success": True, + "data": {"value": json.dumps(persisted.persisted())}, + } + agent._ensure_monitor = AsyncMock() + await agent.handle_project_started( + context("project.notification.started", {"project_id": "project-1", "goal": json.dumps(workload())}) + ) + fake.estimate_job.assert_not_awaited() fake.submit_job.assert_not_awaited() + agent._ensure_monitor.assert_called_once() @pytest.mark.asyncio -async def test_approved_submit_flow_starts_monitor(): +async def test_restart_does_not_retry_uncertain_submission(): fake = FakeJungleGridClient() agent = agent_with_mocks(fake) - execution = ProjectExecution("project-1", workload(), "estimate-1", {"available": True}) - agent.executions["project-1"] = execution - agent._monitor = AsyncMock() - + persisted = ProjectExecution( + "project-1", + workload(), + "estimate-1", + {"available": True}, + submission_state="submitting", + ) + agent.project_adapter.get_project_artifact.return_value = { + "success": True, + "data": {"value": json.dumps(persisted.persisted())}, + } await agent.handle_project_message( context( "project.notification.message_received", - {"project_id": "project-1", "sender_id": "human:user", "content": {"text": "APPROVE estimate-1"}}, + { + "project_id": "project-1", + "sender_id": "human:user", + "content": {"text": "APPROVE estimate-1"}, + }, ) ) - await asyncio.sleep(0) + fake.submit_job.assert_not_awaited() - fake.submit_job.assert_awaited_once_with(workload()) - assert execution.job_id == "job_123" - agent._monitor.assert_awaited_once_with(execution) +def test_current_command_array_is_preserved(): + requested = parse_workload_goal(json.dumps(workload())) + assert build_estimate_payload(requested)["command"] == ["python", "-c", "print(42)"] + assert build_submit_payload(requested)[0]["command"] == ["python", "-c", "print(42)"] -@pytest.mark.asyncio -async def test_concurrent_matching_approvals_submit_only_once(): - fake = FakeJungleGridClient() - submit_started = asyncio.Event() - release_submit = asyncio.Event() - async def delayed_submit(_workload): - submit_started.set() - await release_submit.wait() - return {"job_id": "job_123", "status": "queued"} +def test_legacy_command_and_args_are_combined_without_semantic_change(): + requested = parse_workload_goal(json.dumps(workload(command="python", args=["-c", "print(42)"]))) + assert build_submit_payload(requested)[0]["command"] == ["python", "-c", "print(42)"] + assert "args" not in build_submit_payload(requested)[0] - fake.submit_job = AsyncMock(side_effect=delayed_submit) - agent = agent_with_mocks(fake) - agent._monitor = AsyncMock() - execution = ProjectExecution("project-1", workload(), "estimate-1", {"available": True}) - agent.executions["project-1"] = execution - approval = context( - "project.notification.message_received", - {"project_id": "project-1", "sender_id": "human:user", "content": {"text": "APPROVE estimate-1"}}, + +def test_command_array_rejects_separate_args(): + with pytest.raises(ValueError, match="cannot be combined"): + parse_workload_goal(json.dumps(workload(args=["extra"]))) + + +def test_fine_tuning_is_accepted_and_normalized(): + requested = parse_workload_goal(json.dumps(workload(workload_type="fine_tuning"))) + assert build_submit_payload(requested)[0]["workload_type"] == "fine-tuning" + + +def test_invalid_workload_type_is_rejected(): + with pytest.raises(ValueError, match="workload_type must be one of"): + parse_workload_goal(json.dumps(workload(workload_type="interactive"))) + + +def test_input_script_and_expected_artifacts_are_forwarded(): + requested = parse_workload_goal( + json.dumps( + workload( + input_files=[{"input_id": "inp_audio123"}], + script_files=["inp_script123"], + expected_artifacts=["/workspace/artifacts/transcript.txt"], + ) + ) ) + payload = build_submit_payload(requested)[0] + assert payload["input_files"] == [{"input_id": "inp_audio123"}] + assert payload["script_files"] == [{"input_id": "inp_script123"}] + assert payload["expected_artifacts"] == ["/workspace/artifacts/transcript.txt"] - first = asyncio.create_task(agent.handle_project_message(approval)) - await submit_started.wait() - await agent.handle_project_message(approval) - release_submit.set() - await first - await asyncio.sleep(0) - fake.submit_job.assert_awaited_once_with(workload()) +@pytest.mark.parametrize( + "bad", + [ + {"input_files": [{"local_path": "/etc/passwd"}]}, + {"script_files": [{"input_id": "../../secret"}]}, + {"expected_artifacts": ["/tmp/output.txt"]}, + ], +) +def test_arbitrary_local_paths_and_invalid_references_are_rejected(bad): + with pytest.raises(ValueError): + parse_workload_goal(json.dumps(workload(**bad))) + + +def test_environment_references_resolve_only_for_submission(monkeypatch): + monkeypatch.setenv("MODEL_TOKEN", "secret-value") + requested = parse_workload_goal(json.dumps(workload(environment_from_env={"MODEL_TOKEN": "MODEL_TOKEN"}))) + assert "environment" not in build_estimate_payload(requested) + payload, secrets = build_submit_payload(requested) + assert payload["environment"] == {"MODEL_TOKEN": "secret-value"} + assert secrets == ["secret-value"] + + +def test_missing_environment_reference_blocks_submission(monkeypatch): + monkeypatch.delenv("MISSING_TOKEN", raising=False) + requested = parse_workload_goal(json.dumps(workload(environment_from_env={"MODEL_TOKEN": "MISSING_TOKEN"}))) + with pytest.raises(ValueError, match="MISSING_TOKEN"): + build_submit_payload(requested) + + +def test_callback_auth_token_is_environment_only(monkeypatch): + monkeypatch.setenv("CALLBACK_TOKEN", "callback-secret") + requested = parse_workload_goal( + json.dumps( + workload( + callback={ + "url": "https://example.test/hooks/jungle", + "metadata": {"project": "demo"}, + "auth_token_from_env": "CALLBACK_TOKEN", + } + ) + ) + ) + estimate = build_estimate_payload(requested) + assert "auth_token" not in json.dumps(estimate) + payload, secrets = build_submit_payload(requested) + assert payload["callback"]["auth_token"] == "callback-secret" + assert secrets == ["callback-secret"] + + +def test_literal_secrets_and_secret_metadata_are_rejected(): + with pytest.raises(ValueError, match="must not contain"): + parse_workload_goal(json.dumps(workload(command=["curl", "-H", "Bearer secret"]))) + with pytest.raises(ValueError, match="secret-like"): + parse_workload_goal(json.dumps(workload(metadata={"api_token": "value"}))) + + +def test_supported_resource_routing_and_timeout_fields_are_forwarded(): + requested = parse_workload_goal( + json.dumps( + workload( + gpu_required=True, + gpu_count=1, + gpu_class="datacenter", + gpu_type="A100", + min_vram_gb=40, + region_preference="us-east", + region_mode="strict", + timeout_seconds=600, + precision="bf16", + disk_gb=50, + ) + ) + ) + payload = build_submit_payload(requested)[0] + assert payload["gpu_required"] is True + assert payload["gpu_type"] == "A100" + assert payload["timeout_seconds"] == 600 + + +def test_constraints_reject_unverified_fields(): + with pytest.raises(ValueError, match="Unsupported constraint fields"): + parse_workload_goal(json.dumps(workload(constraints={"provider": "runpod"}))) @pytest.mark.asyncio -async def test_status_polling_posts_updates_and_completes(): - fake = FakeJungleGridClient() - fake.get_job = AsyncMock( - side_effect=[ - {"job_id": "job_123", "status": "running"}, - {"job_id": "job_123", "status": "completed"}, - ] +async def test_malformed_approval_posts_rejection(): + agent = agent_with_mocks() + agent.executions["project-1"] = ProjectExecution("project-1", workload(), "estimate-1", {"available": True}) + await agent.handle_project_message( + context( + "project.notification.message_received", + { + "project_id": "project-1", + "sender_id": "human:user", + "content": {"text": " APPROVE estimate-1"}, + }, + ) ) - agent = agent_with_mocks(fake) - execution = ProjectExecution("project-1", workload(), "estimate-1", {}, job_id="job_123", last_status="queued") + assert any("Approval rejected" in text for text in message_texts(agent)) - await agent._monitor(execution) - texts = [call.kwargs["content"]["text"] for call in agent.project_adapter.send_project_message.await_args_list] - assert any("`running`" in text for text in texts) - assert any("`completed`" in text for text in texts) - agent.project_adapter.complete_project.assert_awaited_once() +def test_estimate_can_submit_honors_screening_and_availability(): + assert estimate_can_submit({"available": True, "screening": {"can_submit": True}}) + assert not estimate_can_submit({"available": False}) + assert not estimate_can_submit({"screening": {"can_submit": False}}) @pytest.mark.asyncio -async def test_failed_workload_stops_project(): +async def test_status_changes_are_deduplicated(): fake = FakeJungleGridClient() + running = { + "job_id": "job_123", + "status": "running", + "execution_phase": "executing", + "phase_started_at": "2026-06-11T00:00:00Z", + } + fake.get_job.side_effect = [running, running, {"job_id": "job_123", "status": "completed"}] agent = agent_with_mocks(fake) - execution = ProjectExecution("project-1", workload(), "estimate-1", {}, job_id="job_123") + execution = ProjectExecution( + "project-1", workload(), "estimate-1", {}, job_id="job_123", submission_state="submitted" + ) + await agent._monitor(execution) + assert sum("`running`" in text for text in message_texts(agent)) == 1 - await agent._finalize(execution, {"job_id": "job_123", "status": "failed"}) - agent.project_adapter.stop_project.assert_awaited_once() - agent.project_adapter.complete_project.assert_not_awaited() +@pytest.mark.asyncio +async def test_lifecycle_endpoint_and_event_deduplication(): + fake = FakeJungleGridClient() + agent = agent_with_mocks(fake) + execution = ProjectExecution("project-1", workload(), "estimate-1", {}, job_id="job_123") + await agent._collect_events(execution) + await agent._collect_events(execution) + fake.get_job_events.assert_awaited_with("job_123") + assert len(execution.events) == 1 + assert sum("Job completed" in text for text in message_texts(agent)) == 1 @pytest.mark.asyncio -async def test_logs_and_artifacts_are_stored_in_project_artifact(): +async def test_empty_workload_logs_during_startup_do_not_fail(): fake = FakeJungleGridClient() + fake.get_job_logs.return_value = {"items": [], "next_cursor": None} agent = agent_with_mocks(fake) execution = ProjectExecution("project-1", workload(), "estimate-1", {}, job_id="job_123") + await agent._collect_logs(execution) + assert execution.logs == [] + agent.project_adapter.stop_project.assert_not_awaited() - await agent._finalize(execution, {"job_id": "job_123", "status": "completed"}) - fake.get_job_runtime.assert_awaited_once_with("job_123") - fake.get_job_logs.assert_awaited_once_with("job_123") - fake.list_artifacts.assert_awaited_once_with("job_123") - fake.get_artifact.assert_awaited_once_with("job_123", "artifact_1") - artifact_call = agent.project_adapter.set_project_artifact.await_args - assert artifact_call.kwargs["key"] == "jungle_grid_result" - assert "output.json" in artifact_call.kwargs["value"] - assert "stdout_tail" in artifact_call.kwargs["value"] - assert "https://example.test/file" not in artifact_call.kwargs["value"] - assert "[REDACTED]" in artifact_call.kwargs["value"] +@pytest.mark.asyncio +async def test_log_pagination_and_bounded_output(): + fake = FakeJungleGridClient() + fake.get_job_logs.side_effect = [ + {"items": [{"message": "first"}], "next_cursor": "cursor-1"}, + {"items": [{"message": f"line-{index}"} for index in range(250)], "next_cursor": None}, + ] + agent = agent_with_mocks(fake) + execution = ProjectExecution("project-1", workload(), "estimate-1", {}, job_id="job_123") + await agent._collect_logs(execution) + await agent._collect_logs(execution) + assert fake.get_job_logs.await_args_list[1].kwargs["cursor"] == "cursor-1" + assert len(execution.logs) == 200 @pytest.mark.asyncio -async def test_resolved_environment_values_are_redacted_from_results(monkeypatch): - monkeypatch.setenv("MODEL_TOKEN", "secret-value") +async def test_runtime_unavailable_is_nonfatal_and_artifacts_have_no_signed_url(): fake = FakeJungleGridClient() - fake.get_job_logs = AsyncMock(return_value={"items": [{"message": "token=secret-value"}]}) + fake.get_job_runtime.side_effect = JungleGridError("NOT_FOUND", "not ready", 404) + fake.list_artifacts.return_value = { + "artifacts": [ + { + "artifact_id": "artifact_1", + "filename": "output.json", + "download_url": "https://storage.example/file?signature=secret", + } + ] + } agent = agent_with_mocks(fake) - requested = {**workload(), "environment_from_env": {"MODEL_TOKEN": "MODEL_TOKEN"}} - execution = ProjectExecution( - "project-1", - requested, - "estimate-1", - {}, - job_id="job_123", - submit_payload=build_submit_payload(requested), - secret_values=["secret-value"], + execution = ProjectExecution("project-1", workload(), "estimate-1", {}, job_id="job_123") + await agent._finalize(execution, {"job_id": "job_123", "status": "completed"}) + result_call = next( + call + for call in agent.project_adapter.set_project_artifact.await_args_list + if call.kwargs["key"] == "jungle_grid_result" ) + value = result_call.kwargs["value"] + assert "Runtime details are not available" in value + assert "https://storage.example" not in value + assert "signature=secret" not in value + fake.get_artifact.assert_not_awaited() + agent.project_adapter.complete_project.assert_awaited_once() - await agent._finalize(execution, {"job_id": "job_123", "status": "completed"}) - artifact_value = agent.project_adapter.set_project_artifact.await_args.kwargs["value"] - assert "secret-value" not in artifact_value - assert "[REDACTED]" in artifact_value +@pytest.mark.asyncio +@pytest.mark.parametrize("status", ["failed", "cancelled"]) +async def test_failed_or_cancelled_job_stops_project(status): + agent = agent_with_mocks() + execution = ProjectExecution("project-1", workload(), "estimate-1", {}, job_id="job_123") + await agent._finalize(execution, {"job_id": "job_123", "status": status}) + agent.project_adapter.stop_project.assert_awaited_once() + agent.project_adapter.complete_project.assert_not_awaited() @pytest.mark.asyncio -async def test_cancellation_uses_matching_job_id(): +@pytest.mark.parametrize( + ("sender", "command"), + [ + ("agent:other", "CANCEL job_123"), + ("human:user", "CANCEL job_other"), + ("human:user", " CANCEL job_123"), + ("human:user", "CANCEL job_123\n"), + ], +) +async def test_unauthorized_mismatched_or_malformed_cancellation_is_rejected(sender, command): fake = FakeJungleGridClient() agent = agent_with_mocks(fake) agent.executions["project-1"] = ProjectExecution("project-1", workload(), "estimate-1", {}, job_id="job_123") - await agent.handle_project_message( context( "project.notification.message_received", - {"project_id": "project-1", "sender_id": "human:user", "content": {"text": "CANCEL job_123"}}, + {"project_id": "project-1", "sender_id": sender, "content": {"text": command}}, ) ) - - fake.cancel_job.assert_awaited_once_with("job_123", "Requested from OpenAgents by human:user") + fake.cancel_job.assert_not_awaited() @pytest.mark.asyncio -async def test_non_human_cancellation_is_rejected(): +async def test_duplicate_and_terminal_cancellation_are_safe(): fake = FakeJungleGridClient() agent = agent_with_mocks(fake) - agent.executions["project-1"] = ProjectExecution("project-1", workload(), "estimate-1", {}, job_id="job_123") - - await agent.handle_project_message( - context( - "project.notification.message_received", - {"project_id": "project-1", "sender_id": "agent:other", "content": {"text": "CANCEL job_123"}}, - ) + execution = ProjectExecution("project-1", workload(), "estimate-1", {}, job_id="job_123", cancel_requested=True) + agent.executions["project-1"] = execution + cancellation = context( + "project.notification.message_received", + { + "project_id": "project-1", + "sender_id": "human:user", + "content": {"text": "CANCEL job_123"}, + }, ) - + await agent.handle_project_message(cancellation) + execution.cancel_requested = False + execution.terminal = True + await agent.handle_project_message(cancellation) fake.cancel_job.assert_not_awaited() @pytest.mark.asyncio -@pytest.mark.parametrize("command", ["CANCEL job_456", " CANCEL job_123", "CANCEL job_123\n"]) -async def test_cancellation_requires_exact_command(command): +async def test_matching_human_cancellation_uses_recorded_job_only(): fake = FakeJungleGridClient() agent = agent_with_mocks(fake) agent.executions["project-1"] = ProjectExecution("project-1", workload(), "estimate-1", {}, job_id="job_123") - await agent.handle_project_message( context( "project.notification.message_received", - {"project_id": "project-1", "sender_id": "human:user", "content": {"text": command}}, + { + "project_id": "project-1", + "sender_id": "human:user", + "content": {"text": "CANCEL job_123"}, + }, ) ) - - fake.cancel_job.assert_not_awaited() - - -@pytest.mark.asyncio -async def test_missing_api_key_is_reported_without_network_call(monkeypatch): - monkeypatch.delenv("JUNGLE_GRID_API_KEY", raising=False) - client = JungleGridClient() - with pytest.raises(JungleGridError, match="JUNGLE_GRID_API_KEY is required"): - await client.estimate_job(workload()) - - -def test_invalid_workload_is_rejected(): - with pytest.raises(ValueError, match="Missing required workload fields"): - parse_workload_goal('{"workload_type": "batch"}') - - -def test_workload_requires_positive_model_size(): - with pytest.raises(ValueError, match="model_size_gb"): - parse_workload_goal(json.dumps({**workload(), "model_size_gb": 0})) - - -def test_estimate_payload_matches_current_draft_job_fields(): - requested = { - **workload(), - "constraints": {"max_price_per_hour": 2.5, "preferred_gpu_family": "l4"}, - } - - assert build_estimate_payload(requested) == requested - - -def test_workload_rejects_literal_credentials_and_secret_like_metadata(): - with pytest.raises(ValueError, match="must not contain API keys"): - parse_workload_goal(json.dumps({**workload(), "command": "curl -H 'Bearer secret-value'"})) - with pytest.raises(ValueError, match="secret-like keys"): - parse_workload_goal(json.dumps({**workload(), "metadata": {"api_token": "secret-value"}})) - - -def test_build_submit_payload_resolves_environment_only_at_submission(monkeypatch): - monkeypatch.setenv("MODEL_TOKEN", "secret-value") - requested = {**workload(), "environment_from_env": {"MODEL_TOKEN": "MODEL_TOKEN"}} - - assert "environment_from_env" not in build_estimate_payload(requested) - assert build_submit_payload(requested)["environment"] == {"MODEL_TOKEN": "secret-value"} - assert public_workload(requested)["environment_from_env"] == {"MODEL_TOKEN": "MODEL_TOKEN"} - - -def test_build_submit_payload_rejects_missing_local_environment(monkeypatch): - monkeypatch.delenv("MISSING_MODEL_TOKEN", raising=False) - requested = {**workload(), "environment_from_env": {"MODEL_TOKEN": "MISSING_MODEL_TOKEN"}} - - with pytest.raises(ValueError, match="MISSING_MODEL_TOKEN"): - build_submit_payload(requested) - - -def test_secret_redaction_removes_api_keys_and_bearer_tokens(): - text = redact_sensitive("failed with Bearer abc123 and jg_super_secret", "jg_super_secret") - assert "abc123" not in text - assert "jg_super_secret" not in text - assert "[REDACTED]" in text - - -def test_public_workload_redacts_metadata_values(): - shared = public_workload({**workload(), "metadata": {"nested": {"value": "secret"}}}) - assert shared["metadata"] == {"nested": "[REDACTED]"} - assert "secret" not in json.dumps(shared) + fake.cancel_job.assert_awaited_once_with("job_123", "Requested from OpenAgents by human:user") -def test_project_data_redaction_removes_nested_workload_secrets(): - result = sanitize_project_data( - {"logs": [{"message": "token=secret-value"}], "error": "Bearer test-api-key"}, - ["secret-value", "test-api-key"], +def test_redaction_removes_api_keys_environment_values_and_signed_urls(): + safe = sanitize_project_data( + { + "message": "Bearer jg_test_api_key secret-value", + "download_url": "https://storage.example/file?signature=abc", + "authorization": "Bearer abc", + }, + ["jg_test_api_key", "secret-value"], ) - assert "secret-value" not in json.dumps(result) - assert "test-api-key" not in json.dumps(result) + encoded = json.dumps(safe) + assert "jg_test_api_key" not in encoded + assert "secret-value" not in encoded + assert "storage.example" not in encoded + assert encoded.count("[REDACTED]") >= 3 -def test_estimate_can_submit_honors_explicit_unavailability(): - assert estimate_can_submit({"available": True, "can_submit": True}) - assert not estimate_can_submit({"available": False}) - assert not estimate_can_submit({"can_submit": False}) +def test_public_workload_hides_metadata_values(): + shared = public_workload(workload(metadata={"customer": "private-value"})) + assert shared["metadata"] == {"customer": "[REDACTED]"} -@pytest.mark.parametrize( - ("status", "label"), - [ - ("submitted", "submitted"), - ("queued", "queued"), - ("assigned", "assigned (provisioning)"), - ("running", "running"), - ("completed", "completed"), - ("failed", "failed"), - ("rejected", "rejected"), - ("cancelled", "cancelled"), - ], -) -def test_lifecycle_labels(status, label): - assert lifecycle_label(status) == label +@pytest.mark.asyncio +async def test_missing_api_key_fails_before_network(monkeypatch): + monkeypatch.delenv("JUNGLE_GRID_API_KEY", raising=False) + with pytest.raises(JungleGridError, match="JUNGLE_GRID_API_KEY is required"): + await JungleGridClient().estimate_job(workload()) class FakeResponse: @@ -556,67 +704,116 @@ async def __aexit__(self, exc_type, exc, tb): @pytest.mark.asyncio -async def test_invalid_jungle_grid_response(monkeypatch): - monkeypatch.setenv("JUNGLE_GRID_API_KEY", "test-api-key") - monkeypatch.setattr(MODULE.aiohttp, "ClientSession", lambda **kwargs: FakeSession(FakeResponse(200, "not-json"))) - client = JungleGridClient() - - with pytest.raises(JungleGridError, match="invalid JSON"): +async def test_timeout_uses_bounded_retries_for_reads_only(monkeypatch): + monkeypatch.setenv("JUNGLE_GRID_API_KEY", "jg_test_api_key") + monkeypatch.setattr( + MODULE.aiohttp, + "ClientSession", + lambda **kwargs: FakeSession(error=asyncio.TimeoutError()), + ) + sleep = AsyncMock() + client = JungleGridClient(read_retries=2, retry_delay_seconds=0, sleep=sleep) + with pytest.raises(JungleGridError, match="timed out"): await client.get_job("job_123") + assert sleep.await_count == 2 + sleep.reset_mock() + with pytest.raises(JungleGridError, match="timed out"): + await client.submit_job(workload()) + sleep.assert_not_awaited() @pytest.mark.asyncio -async def test_network_timeout_is_sanitized(monkeypatch): - monkeypatch.setenv("JUNGLE_GRID_API_KEY", "test-api-key") +async def test_malformed_json_response_is_handled(monkeypatch): + monkeypatch.setenv("JUNGLE_GRID_API_KEY", "jg_test_api_key") monkeypatch.setattr( MODULE.aiohttp, "ClientSession", - lambda **kwargs: FakeSession(error=asyncio.TimeoutError()), + lambda **kwargs: FakeSession(FakeResponse(200, "not-json")), ) - client = JungleGridClient() - - with pytest.raises(JungleGridError, match="timed out"): - await client.get_job("job_123") + with pytest.raises(JungleGridError, match="invalid JSON"): + await JungleGridClient(read_retries=0).get_job("job_123") @pytest.mark.asyncio -async def test_api_error_is_sanitized(monkeypatch): - monkeypatch.setenv("JUNGLE_GRID_API_KEY", "test-api-key") +async def test_api_error_code_and_message_are_sanitized(monkeypatch): + monkeypatch.setenv("JUNGLE_GRID_API_KEY", "jg_test_api_key") body = json.dumps( { "error": { "code": "provider_jg_private_backend", - "message": "Bearer test-api-key is not allowed", + "message": "Bearer jg_test_api_key is forbidden", } } ) - monkeypatch.setattr(MODULE.aiohttp, "ClientSession", lambda **kwargs: FakeSession(FakeResponse(403, body))) - client = JungleGridClient() - + monkeypatch.setattr( + MODULE.aiohttp, + "ClientSession", + lambda **kwargs: FakeSession(FakeResponse(403, body)), + ) with pytest.raises(JungleGridError) as exc_info: - await client.get_job("job_123") + await JungleGridClient().get_job("job_123") assert "jg_private_backend" not in exc_info.value.code - assert "[REDACTED]" in exc_info.value.code - assert "test-api-key" not in str(exc_info.value) - + assert "jg_test_api_key" not in str(exc_info.value) -def test_client_uses_documented_rest_api_environment(monkeypatch): - monkeypatch.setenv("JUNGLE_GRID_API", "https://orchestrator.example.test/") - monkeypatch.setenv("JUNGLE_GRID_API_KEY", "test-api-key") +def test_client_prefers_official_api_base_and_normalizes_slashes(monkeypatch): + monkeypatch.setenv("JUNGLEGRID_API_BASE", "https://official.example.test///") + monkeypatch.setenv("JUNGLE_GRID_API_URL", "https://legacy.example.test") client = JungleGridClient() + assert client.api_base == "https://official.example.test" + - assert client.api_base == "https://orchestrator.example.test" +def test_client_keeps_legacy_api_base_fallback(monkeypatch): + monkeypatch.delenv("JUNGLEGRID_API_BASE", raising=False) + monkeypatch.setenv("JUNGLE_GRID_API_URL", "https://legacy.example.test/") + assert JungleGridClient().api_base == "https://legacy.example.test" @pytest.mark.asyncio -async def test_client_uses_documented_runtime_and_log_routes(monkeypatch): - monkeypatch.setenv("JUNGLE_GRID_API_KEY", "test-api-key") +async def test_client_uses_current_routes_and_log_pagination(monkeypatch): + monkeypatch.setenv("JUNGLE_GRID_API_KEY", "jg_test_api_key") client = JungleGridClient() client._request = AsyncMock(return_value={}) + await client.estimate_job({}) + await client.submit_job({}) + await client.get_job("job 123") + await client.get_job_events("job 123") + await client.get_job_logs("job 123", limit=50, cursor="cursor-1") + await client.get_job_runtime("job 123") + await client.list_artifacts("job 123") + await client.get_artifact("job 123", "artifact 1") + await client.cancel_job("job 123", "reason") + paths = [call.args[1] for call in client._request.await_args_list] + assert paths == [ + "/v1/mcp/jobs/estimate", + "/v1/mcp/jobs", + "/v1/mcp/jobs/job%20123", + "/v1/jobs/job%20123/events", + "/v1/mcp/jobs/job%20123/logs?limit=50&cursor=cursor-1", + "/v1/jobs/job%20123/runtime", + "/v1/mcp/jobs/job%20123/artifacts", + "/v1/mcp/jobs/job%20123/artifacts/artifact%201/download", + "/v1/mcp/jobs/job%20123/cancel", + ] + + +def test_execution_state_never_persists_secret_values(): + execution = ProjectExecution( + "project-1", + workload(environment_from_env={"TOKEN": "LOCAL_TOKEN"}), + "estimate-1", + {"available": True}, + secret_values=["resolved-secret"], + ) + assert "resolved-secret" not in json.dumps(execution.persisted()) + assert execution.persisted()["workload"]["environment_from_env"] == {"TOKEN": "LOCAL_TOKEN"} - await client.get_job_runtime("job_123") - await client.get_job_logs("job_123") - assert client._request.await_args_list[0].args == ("GET", "/v1/jobs/job_123/runtime") - assert client._request.await_args_list[1].args == ("GET", "/v1/jobs/job_123/logs?tail=100") +def test_state_artifact_name_is_stable(): + assert STATE_ARTIFACT == "jungle_grid_execution_state" + + +def test_redact_sensitive_handles_bearer_and_jungle_grid_keys(): + text = redact_sensitive("Bearer abc and jg_super_secret") + assert "abc" not in text + assert "jg_super_secret" not in text From 7d0bea42d3a43a162eca50cfc42e337ff99358f2 Mon Sep 17 00:00:00 2001 From: dejaguarkyng Date: Thu, 11 Jun 2026 15:18:19 +0000 Subject: [PATCH 5/5] docs: update jungle grid production demo --- .../IMPLEMENTATION_DECISION.md | 153 +++++---- .../09_jungle_grid_gpu_execution/README.md | 317 +++++++++++------- 2 files changed, 284 insertions(+), 186 deletions(-) diff --git a/sdk/demos/09_jungle_grid_gpu_execution/IMPLEMENTATION_DECISION.md b/sdk/demos/09_jungle_grid_gpu_execution/IMPLEMENTATION_DECISION.md index 05857178a..64eba8a83 100644 --- a/sdk/demos/09_jungle_grid_gpu_execution/IMPLEMENTATION_DECISION.md +++ b/sdk/demos/09_jungle_grid_gpu_execution/IMPLEMENTATION_DECISION.md @@ -2,64 +2,97 @@ ## Selected Extension Point -This contribution is a runnable demo network with a Python `WorkerAgent`. The agent -uses OpenAgents' project mod for the long-running workflow, project messages for -estimate and lifecycle updates, and project artifacts for logs and Jungle Grid -artifact metadata. - -Jungle Grid is an external agentic AI workload execution and GPU orchestration -layer, not an OpenAgents transport, launcher agent type, or network mod. A demo -keeps the integration provider-specific while showing a reusable OpenAgents -pattern: an agent delegates asynchronous compute, waits for human approval -before billable work, and returns results to a shared project. - -## Rejected Alternatives - -- **Launcher agent type:** Jungle Grid executes workloads; it is not an interactive - coding-agent runtime managed by the launcher. -- **Core provider integration:** No OpenAgents core abstraction requires a - provider-specific compute backend. -- **Jungle Grid mod:** The integration does not add network-wide event semantics or - shared infrastructure. Existing project events already cover the workflow. -- **Hosted MCP entry:** Jungle Grid's hosted Streamable HTTP endpoint uses OAuth, - while local stdio uses an API key. The direct REST integration keeps approval - and project state inside OpenAgents without requiring an MCP auth change. -- **Local stdio MCP dependency:** The Jungle Grid stdio MCP package is supported, - but a direct Python API client is easier to validate, test, and constrain around - mandatory human approval. It also avoids requiring Node.js for a Python demo. - -## Jungle Grid Contract Used - -The demo uses the documented public execution API: - -- `POST /v1/jobs/estimate` -- `POST /v1/jobs` -- `GET /v1/jobs/{job_id}` +This contribution remains a runnable demo network with a deterministic Python +`WorkerAgent`. It uses OpenAgents projects for assignment and lifecycle, +project messages for human approval and meaningful status changes, and project +artifacts for durable execution state and sanitized results. + +Jungle Grid is an external workload execution service, not an OpenAgents +transport, launcher, credential type, or network mod. Keeping it as a demo makes +the approval boundary and asynchronous project behavior explicit and testable. +The agent calls REST directly because an MCP tool call would otherwise hide the +project-state transition around billable submission. + +## Jungle Grid Contract + +The implementation was aligned against `Jungle-Grid/mcp-server` and the current +orchestrator API implementation, not only the README: + +- `POST /v1/mcp/jobs/estimate` +- `POST /v1/mcp/jobs` +- `GET /v1/mcp/jobs/{job_id}` +- `GET /v1/jobs/{job_id}/events` +- `GET /v1/mcp/jobs/{job_id}/logs` - `GET /v1/jobs/{job_id}/runtime` -- `GET /v1/jobs/{job_id}/logs` -- `POST /v1/jobs/{job_id}/cancel` -- `GET /v1/jobs/{job_id}/artifacts` -- `POST /v1/jobs/{job_id}/artifacts/{artifact_id}/download` - -Authentication is a scoped server-side API key in `JUNGLE_GRID_API_KEY`; the -REST base can be overridden with `JUNGLE_GRID_API`. The -documented lifecycle includes `pending`, `queued`, `assigned`, `running`, -`completed`, `failed`, `rejected`, and `cancelled`. - -The current REST request shape includes `model_size_gb`. Estimate responses -describe classification, routing, capacity, rates, cost ranges, queue waits, -start windows, warnings, and screening without starting compute. Managed -workloads can publish regular files from `/workspace/artifacts`; temporary -signed artifact download URLs are treated as secrets and are not stored in the -OpenAgents project. - -Workload environment values are not accepted in project goals. A goal may use -`environment_from_env` to reference variables available only in the executor -process; those values are resolved after human approval and are excluded from -the estimate request and project-visible output. - -## Contribution Workflow - -OpenAgents' contributing guide asks contributors to create an issue for feature -suggestions before submitting a pull request. This demo should be proposed in an -issue and held for maintainer direction before a PR is opened. +- `POST /v1/mcp/jobs/{job_id}/cancel` +- `GET /v1/mcp/jobs/{job_id}/artifacts` +- `POST /v1/mcp/jobs/{job_id}/artifacts/{artifact_id}/download` + +The official API-base override is `JUNGLEGRID_API_BASE`. +`JUNGLE_GRID_API_URL` and the older demo variable `JUNGLE_GRID_API` remain +compatibility fallbacks. Trailing slashes are removed. + +The public workload types are `inference`, `training`, `fine_tuning`, and +`batch`; `fine_tuning` is sent to REST as `fine-tuning`. The preferred command +shape is an array. Legacy string `command` plus string-array `args` is combined +in order before estimation and submission. + +## Uploaded Files + +The demo accepts previously uploaded Jungle Grid `input_id` values through +`input_files` and `script_files`. This is the minimum safe file workflow: + +- IDs are validated locally and then verified by Jungle Grid during estimate or + submission. +- No goal field can name an executor host path. +- Upload URLs, completion tokens, and storage credentials never enter project + state. + +Uploading OpenAgents artifacts would require a separate authorization and +byte-transfer design. It is intentionally outside this demo rather than +allowing a project goal to read arbitrary local files. + +## Durable Idempotency + +`jungle_grid_execution_state` records the estimate ID, submission state, +recorded job ID, cancellation state, status fingerprint, event IDs, and log +cursor. The agent writes `submitting` before the non-idempotent submission call +and writes the returned job ID immediately afterward. + +After restart: + +- a recorded job resumes monitoring; +- a terminal project is not resubmitted; +- a `submitting` state without a recorded job is not retried automatically, + because the current submission contract does not expose a verified + idempotency key; +- duplicate approvals and cancellations are serialized by a per-project lock. + +This favors avoiding a duplicate billable job over guessing after an ambiguous +network failure. + +## Security Decisions + +- Estimation cannot submit compute. +- Submission requires exact `APPROVE ` from a `human:` identity. +- Cancellation requires exact `CANCEL ` from a `human:` identity. +- API and workload secrets are resolved from environment variables only. +- Callback auth uses `callback.auth_token_from_env`; literal callback secrets + are not accepted. +- Metadata with secret-like keys, Bearer tokens, API-key patterns, and signed + URLs are rejected or redacted. +- Artifact download URLs are not requested during finalization. The client + method exists to match the API, but project state stores metadata only. +- Automated tests mock all external calls. + +The committed `executors.password_hash` is a demo-only group credential. Its +purpose is to establish actual runtime topology membership so project +notifications reach the executor. It must be replaced for a shared deployment. + +## Deliberately Unsupported Goal Fields + +The current public MCP submission contract does not expose arbitrary +host-file paths, CPU or memory sizing, provider pinning, or user-controlled +retry policy. The demo does not invent those fields. It supports the verified +GPU, region, priority, timeout, callback, routing, upload-reference, template, +metadata, and expected-artifact fields accepted by the current API. diff --git a/sdk/demos/09_jungle_grid_gpu_execution/README.md b/sdk/demos/09_jungle_grid_gpu_execution/README.md index 599cf77ab..f5df32ad9 100644 --- a/sdk/demos/09_jungle_grid_gpu_execution/README.md +++ b/sdk/demos/09_jungle_grid_gpu_execution/README.md @@ -1,202 +1,267 @@ # Jungle Grid GPU Execution Demo -This demo shows an OpenAgents execution agent delegating long-running AI and GPU -workloads to [Jungle Grid](https://junglegrid.dev), an agentic AI workload -execution and GPU orchestration layer that classifies intent, resolves capacity, -and places workloads without requiring agents to manage GPU servers. +This demo delegates asynchronous GPU workloads from an OpenAgents project to +[Jungle Grid](https://junglegrid.dev). A deterministic Python `WorkerAgent` +estimates first, waits for exact human approval, submits once, then polls +lifecycle events, status, logs, runtime details, and managed artifact metadata. -The workflow fits OpenAgents because the workload is asynchronous and -collaborative: an agent estimates the job, a human approves spending in the -shared project, and the agent returns lifecycle updates, logs, and artifact -metadata to the same workspace. +```text +Project goal +→ estimate +→ human approval +→ optional input/script references +→ submit +→ lifecycle events and status +→ workload logs +→ runtime details +→ managed artifacts +``` + +The demo calls REST directly so the human approval boundary and durable +OpenAgents project state remain explicit and testable. It does not require an +LLM or an MCP runtime dependency. ## Security And Billing Warning -Jungle Grid jobs may consume credits or incur charges. The executor never submits -a workload when a project starts. It requires an exact approval command from a -human identity after posting the estimate. Keep API keys in environment variables -and do not paste secrets into project goals, messages, logs, metadata, or -committed files. Workloads that need environment values must use -`environment_from_env`; the executor resolves those references only after human -approval, immediately before submission. +Jungle Grid jobs may consume credits or incur charges. Project creation only +estimates. Billable submission requires this exact command from a verified +human identity: -## Prerequisites +```text +APPROVE +``` -- Python with the OpenAgents development package installed. -- A Jungle Grid account and a scoped API key that can estimate, submit, read, and - cancel jobs. -- A public container image suitable for the requested workload. +Cancellation also requires an exact human command: -## Environment Variables +```text +CANCEL +``` -- `JUNGLE_GRID_API_KEY` is required. The agent reads this server-side API key and - sends it only as a Bearer token to Jungle Grid. -- `JUNGLE_GRID_API` optionally overrides the default REST API base, - `https://api.junglegrid.dev`. -- Any workload-specific variables referenced by `environment_from_env` must also - be exported in the executor process. Their values are never placed in the - project goal or estimate request. +Keep credentials in executor environment variables. Do not put secrets in +goals, messages, metadata, logs, or committed files. The demo rejects literal +API-key/Bearer patterns, resolves workload secrets only after approval, redacts +shared output, never reads arbitrary host paths, and never stores temporary +signed artifact URLs. -## Setup +## Prerequisites + +- OpenAgents development dependencies. +- A scoped Jungle Grid API key with estimate, submit, read, logs, artifact, and + cancellation access. +- A GPU-capable public container image or configured private-image credential. +- Previously uploaded Jungle Grid input IDs for file-backed jobs. -From the repository root, install OpenAgents with SDK and development -dependencies so the network, agent, and test commands are available: +Install the repository package and development tools: ```bash pip install -e ".[sdk,dev]" ``` -Export the Jungle Grid API key in the shell that will run the executor. This -keeps the credential out of the repository and network configuration: +## Environment Configuration ```bash export JUNGLE_GRID_API_KEY="jg_..." +export JUNGLEGRID_API_BASE="https://api.junglegrid.dev" +export JUNGLE_GRID_POLL_INTERVAL_SECONDS="10" +export JUNGLE_GRID_MAX_POLL_FAILURES="3" ``` -## Run The Demo - -The current demo assumes exactly one executor. Run one -`jungle-grid-executor` process so a project is estimated and submitted at most -once. +`JUNGLEGRID_API_BASE` is the current official API-base override. +`JUNGLE_GRID_API_URL` and `JUNGLE_GRID_API` are compatibility fallbacks. The +executor removes trailing slashes. Workload variables referenced by +`environment_from_env` must be exported in the executor process. -Start the OpenAgents network from this demo directory. The network enables the -project mod and exposes the `Jungle Grid GPU Execution` project template: +## Start The Network ```bash cd sdk/demos/09_jungle_grid_gpu_execution openagents network start network.yaml ``` -In a second terminal, start the deterministic Python executor. It does not need -an LLM provider key: +The network enables the project mod and restricts the template to the +`executors` group. The committed group password hash is a demo-only credential; +replace it before a shared deployment. + +## Start The Executor ```bash cd sdk/demos/09_jungle_grid_gpu_execution python agents/jungle_grid_executor.py ``` -The script connects with the password hash configured for the `executors` -group. OpenAgents records that connection in -`network.topology.agent_group_membership`, which is the runtime source used by -the project mod. The optional `metadata.agents` list in an agent-group -configuration does not assign runtime membership and is intentionally not used -by this demo. +The executor supplies the configured group password hash during +`async_start`. OpenAgents therefore records it in +`network.topology.agent_group_membership`; static metadata alone does not +establish group membership. Run one executor for this demo. + +## Create A Project + +Open Studio at `http://localhost:8700/studio`, choose +`Jungle Grid GPU Execution`, and provide a JSON goal. + +### Simple Command Job -Open Studio at `http://localhost:8700/studio`, create a project with the -`Jungle Grid GPU Execution` template, and use a JSON object as the project goal. -For example: +The preferred command representation is an array: ```json { - "name": "openagents-batch-demo", - "workload_type": "batch", - "image": "python:3.11-slim", + "name": "openagents-training-demo", + "workload_type": "training", + "image": "pytorch/pytorch:2.4.0-cuda12.1-cudnn9-runtime", + "command": ["python", "-c", "import torch; print(torch.cuda.is_available())"], "model_size_gb": 1, + "gpu_required": true, + "routing_mode": "cost" +} +``` + +The original format remains compatible and is converted without reordering: + +```json +{ + "name": "legacy-command-demo", + "workload_type": "batch", + "image": "nvidia/cuda:12.2.0-base-ubuntu22.04", "command": "python", - "args": ["-c", "print('hello from Jungle Grid')"], - "optimize_for": "cost" + "args": ["-c", "print('hello')"] +} +``` + +Accepted workload types are `inference`, `training`, `fine_tuning`, and +`batch`. + +### File-Backed Job + +Upload files through Jungle Grid first, then use only the returned IDs: + +```json +{ + "name": "openagents-transcription", + "workload_type": "inference", + "image": "ghcr.io/example/whisper-runtime:cuda", + "command": [ + "python", + "/workspace/scripts/transcribe.py", + "/workspace/inputs/audio.wav", + "/workspace/artifacts/transcript.txt" + ], + "script_files": [{"input_id": "inp_script123"}], + "input_files": [{"input_id": "inp_audio123"}], + "expected_artifacts": ["/workspace/artifacts/transcript.txt"] } ``` -The agent validates the request and calls the read-only -`POST /v1/jobs/estimate` endpoint. Current estimates include workload -classification, routing and capacity signals, hourly and total cost ranges, -queue-wait ranges, estimated start windows, warnings, and screening details. -The executor posts that structured estimate and stores it as project artifact -`jungle_grid_estimate`. No compute has been submitted at this point. +Inputs mount under `/workspace/inputs`, scripts under `/workspace/scripts`, and +managed outputs belong under `/workspace/artifacts`. `local_path` and similar +host-file fields are not supported. -For a workload that needs a credential or other environment value, export it in -the executor shell and reference only its local variable name in the goal: +### Environment And Callback Secrets ```bash export MODEL_TOKEN="..." +export CALLBACK_TOKEN="..." ``` ```json { - "name": "openagents-inference-demo", + "name": "secure-inference", "workload_type": "inference", - "image": "example/model-server:latest", - "model_size_gb": 7, - "environment_from_env": { - "MODEL_TOKEN": "MODEL_TOKEN" - }, - "optimize_for": "cost" + "image": "ghcr.io/example/model-runtime:cuda", + "environment_from_env": {"MODEL_TOKEN": "MODEL_TOKEN"}, + "callback": { + "url": "https://example.com/hooks/jungle", + "metadata": {"source": "openagents"}, + "auth_token_from_env": "CALLBACK_TOKEN" + } } ``` -The mapping key is the variable sent to the workload, and the mapping value is -the local executor variable to resolve. Literal `environment` values, API keys, -Bearer tokens, and secret-like metadata keys are rejected. +Environment and callback token values are absent from estimates and are +resolved only after approval. -Review the estimate, then reply in the project with the exact command shown by -the agent. Estimates that explicitly report `available: false` or -`can_submit: false` cannot be approved: +## Estimate And Approval -```text -APPROVE -``` +The executor calls `POST /v1/mcp/jobs/estimate`, stores a sanitized structured +response in `jungle_grid_estimate`, and posts a short summary. It respects +`screening.can_submit`, availability, warnings, fixes, blocked checks, routing, +cost/rate ranges, duration, queue/start windows, and capacity fields returned by +the API. -After approval, the agent submits with `POST /v1/jobs`, polls -`GET /v1/jobs/{job_id}`, and posts public lifecycle changes: pending, queued, -assigned, running, completed, failed, rejected, or cancelled. On a terminal -state it retrieves the runtime surface, the latest 100 stored log entries, and -the managed artifact list. Regular files written by managed workloads under -`/workspace/artifacts` are eligible for automatic upload. +`screening.can_submit: true` does not prove immediate capacity. +`capacity_status.immediate_capacity_confirmed` is the relevant signal. Approval +is blocked when screening or availability explicitly rejects submission. -Artifact download requests mint temporary signed URLs. The executor requests -download metadata but redacts the URL before storing `jungle_grid_result`; do -not log or share signed URLs. +## Monitoring -To cancel a submitted job, reply with the exact job ID: +After approval the executor: -```text -CANCEL -``` +- polls `GET /v1/mcp/jobs/{job_id}` for status, execution phase, status message, + phase timing, delayed-start, scheduling, retry, failure, and completion data; +- polls `GET /v1/jobs/{job_id}/events` separately for platform lifecycle events; +- polls paginated `GET /v1/mcp/jobs/{job_id}/logs`; +- reads `GET /v1/jobs/{job_id}/runtime` at finalization; +- lists managed artifacts after terminal status. -Cancellation is explicit and only applies when the job ID matches the project. -Only a human identity can request cancellation. The agent reports cancellation -failures without exposing the API key. +Lifecycle names are not restricted to a local enum. Event IDs and log cursors +prevent duplicates. Messages are posted only for meaningful state changes. +Empty workload logs during scheduling, provisioning, input preparation, or +container startup do not fail the project. This is polling, not true streaming. -## Failure Behavior +Shared event and log history is bounded to 200 entries each. API keys, Bearer +tokens, resolved environment values, authorization fields, and signed URLs are +redacted. -Invalid workload JSON, missing required fields, missing API keys, timeouts, -invalid Jungle Grid responses, and API errors are posted to the project in -sanitized form. Failed, rejected, or cancelled jobs stop the OpenAgents project. -Completed jobs complete the project. +## Artifacts -The API key needs `jobs:estimate`, `jobs:submit`, `jobs:read`, and `logs:read` -capabilities for the complete flow. +Regular files written under `/workspace/artifacts` are eligible for managed +collection. `jungle_grid_result` contains sanitized job data, bounded lifecycle +events, bounded logs, runtime details when available, and artifact IDs, names, +paths, sizes, and content types returned by Jungle Grid. -## Jungle Grid Interfaces +The API can mint temporary artifact download URLs, but this demo intentionally +does not request or store them. Downloading bytes into an OpenAgents artifact +would require a separate size, authorization, and content-handling policy. -This demo calls the REST API directly so OpenAgents can enforce project-based -human approval. Jungle Grid also provides the `jungle` CLI, whose `submit` -command estimates and asks for confirmation before queuing, and a hosted MCP -endpoint at `https://mcp.junglegrid.dev/mcp`. Hosted MCP uses OAuth; local stdio -MCP uses `JUNGLE_GRID_API_KEY`. The current MCP tools are `estimate_job`, -`submit_job`, `list_jobs`, `get_job`, `get_job_logs`, `cancel_job`, -`list_artifacts`, and `get_artifact`. +## Cancellation And Failure -## Tests +Cancellation is accepted only for the job ID already recorded for that project. +Unauthorized, mismatched, duplicate, and terminal-state cancellation requests +do not call Jungle Grid. -Run the focused mocked tests. They do not contact Jungle Grid or submit paid -work: +Safe GET requests use bounded retries with exponential backoff. Submission is +never automatically retried because the current contract does not expose a +verified idempotency mechanism. If the executor restarts after recording a job, +it resumes monitoring. If it restarts with an uncertain `submitting` state and +no job ID, it refuses to resubmit blindly. -```bash -pytest tests/agents/test_jungle_grid_executor.py -``` +Completed jobs complete the OpenAgents project. Failed, rejected, and cancelled +jobs stop it. Runtime details may be unavailable before assignment/startup and +do not by themselves fail finalization. -Run the repository formatter and linter checks used by the Python project: +## Current Jungle Grid MCP Tools + +The current registry exposes: + +- `estimate_job` +- `submit_job` +- `upload_job_input` +- `list_job_inputs` +- `list_jobs` +- `get_job` +- `get_job_events` +- `get_job_logs` +- `cancel_job` +- `list_artifacts` +- `get_artifact` + +## Tests + +All external requests are mocked. Tests never require a Jungle Grid account, +contact the live API, or submit paid work: ```bash -ruff format --check sdk/demos/09_jungle_grid_gpu_execution tests/agents/test_jungle_grid_executor.py +pytest tests/agents/test_jungle_grid_executor.py -q ruff check sdk/demos/09_jungle_grid_gpu_execution tests/agents/test_jungle_grid_executor.py +ruff format --check sdk/demos/09_jungle_grid_gpu_execution tests/agents/test_jungle_grid_executor.py +mypy --follow-untyped-imports sdk/demos/09_jungle_grid_gpu_execution/agents/jungle_grid_executor.py ``` - -## Optional Live Estimate - -The normal demo performs a live estimate when a project starts, but it never -automatically submits a job. Use a low-cost workload goal, review the estimate in -the project, and do not send the approval command unless you explicitly intend -to start billable compute.