feat: Enhance press-agent communication#6446
Conversation
|
| Filename | Overview |
|---|---|
| press/agent.py | Adds Ed25519 token verification methods; length check, DoesNotExistError guard, and server-identity claim are correctly implemented. Expiry check retains a 60 s post-expiry grace window (flagged in previous review). |
| press/api/agent_auth.py | Thin helper that extracts X-Agent-Token, instantiates Agent, and delegates to extract_and_verify_token; logic is straightforward and correct. |
| press/api/callbacks.py | New update_job endpoint correctly enqueues processing, adds the server filter to prevent cross-server job manipulation, and checks for missing job docs; all issues from prior review rounds appear addressed. |
| press/press/doctype/agent_auth/agent_auth.py | Key rotation logic is mostly correct after previous fixes (reload inside lock, 600 s TTL cache); however regenerate_public_key is never cleared by the rotation flow itself, creating an edge-case where future rotations may be silently skipped on dormant servers. |
| press/press/doctype/server/server.py | Key generation, signing, and initial setup look correct; _setup_agent_auth early-return guard and proper auth.save() after Ansible success address prior concerns. sign_agent_token uses a timezone-stripped naive datetime for timestamp(), which gives a wrong exp claim on non-UTC hosts. |
| press/press/doctype/agent_job/agent_job.py | Undelivered-jobs retry via Redis set is well-structured; srem is correctly placed in the else clause. Two new concerns: unconditional sadd regardless of push_feature, and a per-update DB query for all step docnames in publish_update. |
| press/hooks.py | Correctly registers the daily regenerate_token scheduler and the per-minute retry_poll scheduler. |
Prompt To Fix All With AI
Fix the following 4 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 4
press/press/doctype/agent_auth/agent_auth.py:42-43
**`regenerate_public_key` not cleared after successful rotation**
`_regenerate_token` sets `regenerate_public_key` in the DB and relies on `get_regenerate_public_key()` (called on every agent request) to clear it once the Redis cache expires after 600 s. The rotation flow itself never clears the field. If a server goes quiet for more than 600 s after the cache TTL expires — and then the daily scheduler fires the next pre-expiry rotation — `self.reload()` on line 39 will find `regenerate_public_key` still populated and return early, silently skipping the rotation. A dormant-but-still-registered server could end up with an expired token and no automatic way to recover. Adding a DB clear of `regenerate_public_key` at the end of a successful rotation (or inside `_setup_agent_auth` on success) would close this gap.
### Issue 2 of 4
press/press/doctype/agent_job/agent_job.py:197
**`sadd` called unconditionally regardless of the `push_feature` flag**
`frappe.cache().sadd("undelivered_jobs", ...)` fires on every callback delivery failure, even when `push_feature` is disabled and `retry_poll` is a no-op. In poll-only deployments, the set grows without bound (one entry per unique server with any failure), and when `push_feature` is eventually enabled, `retry_poll` will immediately process the entire accumulated backlog in a single scheduler tick. Guard the `sadd` with the same flag check used in `retry_poll` to avoid this.
### Issue 3 of 4
press/press/doctype/agent_job/agent_job.py:458-467
**Extra DB query per `publish_update` call scales with step count**
`frappe.get_all("Agent Job Step", ...)` is now executed on every `publish_update` invocation. `publish_update` is called from `process_job_updates` on each polled or pushed status change, so a job with N steps triggers N+1 additional socket publishes and one extra DB round-trip on every update cycle. For jobs with dozens of steps updating at high frequency this adds measurable overhead. Consider caching the step names when the job is first processed, or limit this realtime push to a single `list_update` event that the client can use to re-fetch rather than pushing per-step `doc_update` events.
### Issue 4 of 4
press/press/doctype/server/server.py:1938-1945
**`expires_in.timestamp()` on a timezone-stripped naive datetime**
`datetime.datetime.now(datetime.timezone.utc)` yields a UTC-aware datetime. After `.replace(tzinfo=None)` it becomes a naive datetime. Calling `.timestamp()` on a naive datetime interprets it as local time, so on a server in a non-UTC timezone the `exp` claim in the JWT will be offset by the UTC delta — the token effectively expires earlier or later than intended. Remove the `.replace(tzinfo=None)` from the `expires_in` assignment so the aware datetime is converted correctly; strip the timezone only when writing to the Frappe `Datetime` field.
```suggestion
expires_in_aware = datetime.datetime.now(datetime.timezone.utc) + datetime.timedelta(days=90)
# Strip tzinfo only for the Frappe Datetime field (which is stored as naive UTC)
expires_in = expires_in_aware.replace(tzinfo=None)
payload = {
"server": self.name,
"exp": int(expires_in_aware.timestamp()), # 3 month
}
```
Reviews (9): Last reviewed commit: "fix(agent): Throw permission error if ve..." | Re-trigger Greptile
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (69.52%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #6446 +/- ##
===========================================
+ Coverage 49.76% 49.85% +0.08%
===========================================
Files 955 958 +3
Lines 78917 79285 +368
Branches 361 360 -1
===========================================
+ Hits 39272 39526 +254
- Misses 39621 39735 +114
Partials 24 24
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| if not server: | ||
| frappe.throw("Not permitted", frappe.ValidationError) | ||
|
|
||
| verify_agent(server) |
There was a problem hiding this comment.
Parse the server name from sub of jwt token and use it further. Better to not accept server param from request.
|
|
||
| @frappe.whitelist(allow_guest=True) | ||
| @rate_limit(limit=500, seconds=60) | ||
| def update_job(job, server): |
| frappe.enqueue( | ||
| handle_polled_job, | ||
| queue="short", | ||
| polled_job=job, | ||
| job=job_doc, |
There was a problem hiding this comment.
Run the job in request instead of enqueue, so that agent know whether to retry.
In case of failure, rollback changes and then increase callback_failure_count (Check handle_polled_job if it's already handled)
If, callback failure count already crossed, give agent succesful status and mark the job as failure on press.
| def targets(server: str, token: str | None = None): | ||
| verify_agent(server) |
There was a problem hiding this comment.
We can leave it for now, it has it's own token based auth
|
|
||
| @frappe.whitelist(allow_guest=True) | ||
| def benches_are_idle(server: str, access_token: str) -> None: | ||
| def benches_are_idle(server: str) -> None: |
There was a problem hiding this comment.
Better to not accept server parameter. Get it from jwt token sub param
| - role: mariadb_memory_allocator | ||
| - role: nginx | ||
| - role: agent | ||
| - role: setup_agent_auth |
There was a problem hiding this comment.
Remove this and add the step in agent role
| - role: user | ||
| - role: nginx | ||
| - role: agent | ||
| - role: setup_agent_auth |
| - role: user | ||
| - role: nginx | ||
| - role: agent | ||
| - role: setup_agent_auth |
| from press.press.doctype.server.server import BaseServer | ||
|
|
||
|
|
||
| class AgentAuth(Document): |
There was a problem hiding this comment.
Let's make it stateless instead of tracking in a different doctype.
Press can issue token with HS256 and store the last_issue and expiry of jwt token.
Agent can ask to refresh the token beforehand. But, keep something on press to reissue and set the token manually (ansible play).
| "default": 0, | ||
| "fieldname": "push_feature", | ||
| "fieldtype": "Check", | ||
| "label": "Push Feature" |
There was a problem hiding this comment.
Give it a better name to know it's related to agent job
Related PR
Agent