feat: Enhance press-agent communication by 20vikash · Pull Request #6446 · frappe/press

20vikash · 2026-05-17T10:49:47Z

Enhanced auth: Agent now uses a long-lived HS256 token that Press can verify for authenticated agent callbacks.
Token regeneration: Tokens can be regenerated automatically without agent downtime.
Poll -> Push architecture: Press no longer polls agents for job updates. Agents now push job updates to Press in real time.
Retry support: Agent retries failed job update deliveries when Press is temporarily unreachable.
Undelivered jobs recovery: Agents poll press every 10 seconds to run undelivered jobs.

Related PR
Agent

greptile-apps · 2026-05-17T10:55:20Z

Greptile Summary

This PR introduces a significant architectural overhaul of Press–Agent communication: polling is replaced with agent-pushed updates, and shared-secret auth is replaced with per-server Ed25519 tokens stored in a new AgentAuth doctype. Many issues identified in earlier review rounds have been addressed (server filter on update_job, enqueued processing, reload-inside-lock for rotation, DoesNotExistError guard, and correct Ansible status checks).

New auth layer: agents present a signed X-Agent-Token header verified via Ed25519 public keys cached in Redis; dual-key verification supports zero-downtime token rotation with a 600 s overlap window.
Push endpoints: new update_job endpoint with server-scoped job lookup; retry_poll scheduler reconciles any undelivered updates every minute using a Redis set.
Token lifecycle: a daily scheduler pre-rotates tokens within 7 days of expiry using AgentAuth._regenerate_token, running an Ansible playbook under a distributed lock.

Confidence Score: 4/5

The core auth and push-update paths are functional, but a handful of edge cases in the rotation flow and realtime update logic warrant a closer look before merging.

The rotation mechanism leaves regenerate_public_key populated in the database after a successful rotation, relying solely on live agent traffic to clear it. On a dormant server this could block future automated rotations indefinitely. The exp claim is computed from a timezone-stripped naive datetime, which will be wrong on non-UTC hosts. The sadd for undelivered jobs fires unconditionally regardless of the feature flag, building up a backlog silently in poll mode.

press/press/doctype/agent_auth/agent_auth.py (rotation cleanup), press/press/doctype/server/server.py (timestamp computation), press/press/doctype/agent_job/agent_job.py (unconditional sadd and per-update DB query)

Important Files Changed

Filename	Overview
press/agent.py	Adds Ed25519 token verification methods; length check, DoesNotExistError guard, and server-identity claim are correctly implemented. Expiry check retains a 60 s post-expiry grace window (flagged in previous review).
press/api/agent_auth.py	Thin helper that extracts X-Agent-Token, instantiates Agent, and delegates to extract_and_verify_token; logic is straightforward and correct.
press/api/callbacks.py	New update_job endpoint correctly enqueues processing, adds the server filter to prevent cross-server job manipulation, and checks for missing job docs; all issues from prior review rounds appear addressed.
press/press/doctype/agent_auth/agent_auth.py	Key rotation logic is mostly correct after previous fixes (reload inside lock, 600 s TTL cache); however regenerate_public_key is never cleared by the rotation flow itself, creating an edge-case where future rotations may be silently skipped on dormant servers.
press/press/doctype/server/server.py	Key generation, signing, and initial setup look correct; _setup_agent_auth early-return guard and proper auth.save() after Ansible success address prior concerns. sign_agent_token uses a timezone-stripped naive datetime for timestamp(), which gives a wrong exp claim on non-UTC hosts.
press/press/doctype/agent_job/agent_job.py	Undelivered-jobs retry via Redis set is well-structured; srem is correctly placed in the else clause. Two new concerns: unconditional sadd regardless of push_feature, and a per-update DB query for all step docnames in publish_update.
press/hooks.py	Correctly registers the daily regenerate_token scheduler and the per-minute retry_poll scheduler.

Prompt To Fix All With AI

Fix the following 4 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 4
press/press/doctype/agent_auth/agent_auth.py:42-43
**`regenerate_public_key` not cleared after successful rotation**

`_regenerate_token` sets `regenerate_public_key` in the DB and relies on `get_regenerate_public_key()` (called on every agent request) to clear it once the Redis cache expires after 600 s. The rotation flow itself never clears the field. If a server goes quiet for more than 600 s after the cache TTL expires — and then the daily scheduler fires the next pre-expiry rotation — `self.reload()` on line 39 will find `regenerate_public_key` still populated and return early, silently skipping the rotation. A dormant-but-still-registered server could end up with an expired token and no automatic way to recover. Adding a DB clear of `regenerate_public_key` at the end of a successful rotation (or inside `_setup_agent_auth` on success) would close this gap.

### Issue 2 of 4
press/press/doctype/agent_job/agent_job.py:197
**`sadd` called unconditionally regardless of the `push_feature` flag**

`frappe.cache().sadd("undelivered_jobs", ...)` fires on every callback delivery failure, even when `push_feature` is disabled and `retry_poll` is a no-op. In poll-only deployments, the set grows without bound (one entry per unique server with any failure), and when `push_feature` is eventually enabled, `retry_poll` will immediately process the entire accumulated backlog in a single scheduler tick. Guard the `sadd` with the same flag check used in `retry_poll` to avoid this.

### Issue 3 of 4
press/press/doctype/agent_job/agent_job.py:458-467
**Extra DB query per `publish_update` call scales with step count**

`frappe.get_all("Agent Job Step", ...)` is now executed on every `publish_update` invocation. `publish_update` is called from `process_job_updates` on each polled or pushed status change, so a job with N steps triggers N+1 additional socket publishes and one extra DB round-trip on every update cycle. For jobs with dozens of steps updating at high frequency this adds measurable overhead. Consider caching the step names when the job is first processed, or limit this realtime push to a single `list_update` event that the client can use to re-fetch rather than pushing per-step `doc_update` events.

### Issue 4 of 4
press/press/doctype/server/server.py:1938-1945
**`expires_in.timestamp()` on a timezone-stripped naive datetime**

`datetime.datetime.now(datetime.timezone.utc)` yields a UTC-aware datetime. After `.replace(tzinfo=None)` it becomes a naive datetime. Calling `.timestamp()` on a naive datetime interprets it as local time, so on a server in a non-UTC timezone the `exp` claim in the JWT will be offset by the UTC delta — the token effectively expires earlier or later than intended. Remove the `.replace(tzinfo=None)` from the `expires_in` assignment so the aware datetime is converted correctly; strip the timezone only when writing to the Frappe `Datetime` field.

```suggestion
		expires_in_aware = datetime.datetime.now(datetime.timezone.utc) + datetime.timedelta(days=90)

		# Strip tzinfo only for the Frappe Datetime field (which is stored as naive UTC)
		expires_in = expires_in_aware.replace(tzinfo=None)

		payload = {
			"server": self.name,
			"exp": int(expires_in_aware.timestamp()),  # 3 month
		}
```

_{Reviews (9): Last reviewed commit: "fix(agent): Throw permission error if ve..." | Re-trigger Greptile}

codecov · 2026-05-18T10:49:55Z

Codecov Report

❌ Patch coverage is 69.52381% with 128 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.85%. Comparing base (b0f767a) to head (5ce545a).
⚠️ Report is 1 commits behind head on develop.

Files with missing lines	Patch %	Lines
press/press/doctype/server/server.py	30.43%	32 Missing ⚠️
press/press/doctype/agent_job/agent_job.py	41.02%	23 Missing ⚠️
press/api/site.py	42.42%	19 Missing ⚠️
press/api/agent_auth.py	57.14%	15 Missing ⚠️
press/agent.py	66.66%	11 Missing ⚠️
...s/press/doctype/database_server/database_server.py	44.44%	5 Missing ⚠️
...ss/doctype/database_server/test_database_server.py	92.15%	4 Missing ⚠️
...press/doctype/analytics_server/analytics_server.py	0.00%	3 Missing ⚠️
press/press/doctype/log_server/log_server.py	0.00%	3 Missing ⚠️
press/press/doctype/nat_server/nat_server.py	0.00%	3 Missing ⚠️
... and 5 more

❌ Your patch check has failed because the patch coverage (69.52%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #6446      +/-   ##
===========================================
+ Coverage    49.76%   49.85%   +0.08%     
===========================================
  Files          955      958       +3     
  Lines        78917    79285     +368     
  Branches       361      360       -1     
===========================================
+ Hits         39272    39526     +254     
- Misses       39621    39735     +114     
  Partials        24       24

Flag	Coverage Δ
dashboard	`59.81% <ø> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

tanmoysrt · 2026-05-20T05:48:48Z

 	if not server:
 		frappe.throw("Not permitted", frappe.ValidationError)

+	verify_agent(server)


Parse the server name from sub of jwt token and use it further. Better to not accept server param from request.

tanmoysrt · 2026-05-20T05:49:48Z

+
+@frappe.whitelist(allow_guest=True)
+@rate_limit(limit=500, seconds=60)
+def update_job(job, server):


Add type hint for parameters

tanmoysrt · 2026-05-20T05:51:06Z

+	frappe.enqueue(
+		handle_polled_job,
+		queue="short",
+		polled_job=job,
+		job=job_doc,


Run the job in request instead of enqueue, so that agent know whether to retry.
In case of failure, rollback changes and then increase callback_failure_count (Check handle_polled_job if it's already handled)

If, callback failure count already crossed, give agent succesful status and mark the job as failure on press.

tanmoysrt · 2026-05-20T05:51:23Z

+def targets(server: str, token: str | None = None):
+	verify_agent(server)


We can leave it for now, it has it's own token based auth

tanmoysrt · 2026-05-20T05:51:46Z


 @frappe.whitelist(allow_guest=True)
-def benches_are_idle(server: str, access_token: str) -> None:
+def benches_are_idle(server: str) -> None:


Better to not accept server parameter. Get it from jwt token sub param

tanmoysrt · 2026-05-20T05:55:00Z

    - role: mariadb_memory_allocator
    - role: nginx
    - role: agent
+    - role: setup_agent_auth


Remove this and add the step in agent role

tanmoysrt · 2026-05-20T05:55:21Z

    - role: user
    - role: nginx
    - role: agent
+    - role: setup_agent_auth


same as above

tanmoysrt · 2026-05-20T05:55:27Z

    - role: user
    - role: nginx
    - role: agent
+    - role: setup_agent_auth


same as above

tanmoysrt · 2026-05-20T05:58:05Z

+	from press.press.doctype.server.server import BaseServer
+
+
+class AgentAuth(Document):


Let's make it stateless instead of tracking in a different doctype.
Press can issue token with HS256 and store the last_issue and expiry of jwt token.

Agent can ask to refresh the token beforehand. But, keep something on press to reissue and set the token manually (ansible play).

tanmoysrt · 2026-05-20T06:15:21Z

+			"default": 0,
+			"fieldname": "push_feature",
+			"fieldtype": "Check",
+			"label": "Push Feature"


Give it a better name to know it's related to agent job

20vikash added 22 commits May 1, 2026 14:43

feat(server): Scaffold setup_agent_auth

eaef4c2

feat(server): Generate ED25519 key pair, set private key to agent

b6454ef

feat(server): Playbook for agent auth

d6e8876

feat(agent): Verify response token

567135f

refactor(server): Use raw format for public key, pkcs for private

47d2b04

refactor(agent): Also verify response token in raw_request

bc16af3

feat(server): Setup Agent Auth in proxy and database

2db84d8

fix(server): Move ED25519 key generation to BaseServer

62628f5

feat(server): Update Proxy and Database DOCTYPE

d841835

refactor(server): Move setup_agent_auth whitelist to BaseServer

f82609a

feat(server): Setup Agent Auth in server setup time

56a5522

fix(agent): Don't verify agent responses

6c19b5c

feat(api): Verify agent in whitelists which agent requests

f293457

refactor(agent-job): Add agent type annotation to poll_random_jobs

f5f4882

refactor(agent): Send agent signed long lived token

87e76f9

feat(agent): Schedule to check for token regeneration daily

3963c33

fix(agent-auth): Handle token regeneration edge cases

df10f26

feat(callback): Update status given by agent

6d72f95

feat(agent-job): Publish realtime for each step

231704e

refactor(agent-job): Move agent step publish into publish_update

956d8a4

feat(agent-job): Retry schedule for undelivered jobs

ee77c1d

feat(press-settings): Add feature flag for agent job push

5019d32

20vikash requested review from Aradhya-Tripathi, adityahase, balamurali27, ssiyad and tanmoysrt as code owners May 17, 2026 10:49

greptile-apps Bot reviewed May 17, 2026

View reviewed changes

Comment thread press/press/doctype/agent_auth/agent_auth.py Outdated

Comment thread press/press/doctype/agent_auth/agent_auth.py Outdated

Comment thread press/agent.py Outdated

Comment thread press/agent.py Outdated

Comment thread press/api/callbacks.py Outdated

20vikash mentioned this pull request May 17, 2026

Enhance: Press - Agent Communication frappe/agent#512

Open

20vikash added 3 commits May 18, 2026 08:32

fix(agent-job): Remove undelivered jobs cache after its done

62c2792

refactor(callbacks): Enqueue handle polled jobs

c364466

fix(agent): Throw permission error if verification failed

acd656a

20vikash marked this pull request as draft May 18, 2026 08:47

20vikash marked this pull request as ready for review May 18, 2026 08:47

Merge remote-tracking branch 'upstream/develop' into press_agent

ad03118

20vikash changed the title ~~Enhance: Press - Agent Communication.~~ feat: Enhance press-agent communication May 18, 2026

refactor(server): Reduce server on_update complexity

2c18ae3

20vikash force-pushed the press_agent branch from b551511 to 11d3655 Compare May 18, 2026 14:37

feat(agent-auth): Add tests

7766468

20vikash force-pushed the press_agent branch from 11d3655 to 7766468 Compare May 18, 2026 14:45

20vikash added 8 commits May 18, 2026 15:34

fix(test): Fix mock tests

9cd1279

feat(test): Add more agent auth tests

f9fe94a

fix(test-server): Only mock cache.delete_key

afa304e

fix(test): Test fixes

dfdf8dc

fix(lint): Fix lint issues

450103d

fix(server): Fix set db healthcheck

bd1d17a

fix(test-audit): Fix flaky backup audit by using relative timestamps

0f7c045

revert(test-audit): Revert flaky test_audit changes

29deace

tanmoysrt requested changes May 20, 2026

View reviewed changes

20vikash and others added 9 commits May 24, 2026 14:23

refactor(agent): Use HS256 and hand off retry and regenerate to agent

1d07e4f

Merge branch 'develop' into press_agent

5a77ccc

fix(ruff): Fix ruff issues

eb2ab59

feat(agent): Add test cases

1997a5a

fix(test): Add update_feature patch

90dec3a

fix(server): Ignore update_feature on tests

1de5014

chore(server): Remove sync_database_server_public_status line

3d64c7d

fix(callbacks): Fix test cases

3cd7f3d

chore(agent): Remove printing payload

5ce545a

		def targets(server: str, token: str \| None = None):
		verify_agent(server)

		from press.press.doctype.server.server import BaseServer


		class AgentAuth(Document):

Conversation

20vikash commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

20vikash commented May 17, 2026 •

edited

Loading

greptile-apps Bot commented May 17, 2026 •

edited

Loading

codecov Bot commented May 18, 2026 •

edited

Loading