Skip to content

Fix /v1/jobs/statuses pagination for jobs that share a ModifyIndex#28178

Open
afreidah wants to merge 1 commit into
hashicorp:mainfrom
afreidah:fix-jobs-statuses-pagination-tokenizer
Open

Fix /v1/jobs/statuses pagination for jobs that share a ModifyIndex#28178
afreidah wants to merge 1 commit into
hashicorp:mainfrom
afreidah:fix-jobs-statuses-pagination-tokenizer

Conversation

@afreidah

@afreidah afreidah commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

What

/v1/jobs/statuses (which backs the UI jobs page) paginates with a next_token cursor built from each job's ModifyIndex alone. That only worked while ModifyIndex was unique per job. #28158 made the jobs modify_index index non-unique - several jobs can legitimately share a ModifyIndex when written in one Raft transaction - and once that's true, a ModifyIndex-only cursor no longer identifies a single position in the list.

This is the narrow fix for that: give the cursor a tiebreaker so it points at exactly one job - ModifyIndex + Namespace + ID, compared numerically on the index, then by namespace, then by id. That matches the order the state store actually walks the non-unique modify_index index (which breaks ties on the (Namespace, ID) primary key), so paging lines up with the data.

Relates to #28167.

Symptoms fixed

With jobs sharing a ModifyIndex and pagination on (e.g. 30 jobs, per_page=25):

  • Duplicate rows - jobs at a page boundary come back again on the next page.
  • Stall / unreachable jobs - if a shared-ModifyIndex group is larger than per_page, the cursor never advances past it; the same page repeats and older jobs become unreachable.
  • Broken "Last" page - "jump to last" landed on the wrong boundary inside a tied group, dropping the very oldest jobs and pulling in newer ones.
  • Ghost page after "Last" - because "Last" wasn't actually last, Next stayed enabled and revealed a few more (already-/never-shown) jobs past it.

All four share one root cause and are fixed by the single tokenizer change.

Reproduction

Same Docker A/B harness as #28132/#28158 (1 server + 1 client, no Consul). Jobs are forced to share a ModifyIndex by giving every alloc the same absolute exit epoch, so their status writes coalesce into one Raft transaction:
https://github.com/afreidah/nomad/tree/repro-jobs-statuses-28132/repro-28132

Walking /v1/jobs/statuses with a small per_page over jobs that share a ModifyIndex, before the fix:

page  1:  6 job(s)   next_token=77
page  2:  6 job(s)   next_token=76
page  3:  6 job(s)   next_token=76   <- same token it was given; repeats forever
duplicated:      j03 j04 j05 j20
never returned:  j09 j10 j11 j12 j13 j14

After the fix the same walk returns every job exactly once and terminates.

Compatibility

Bare-integer tokens minted by older clients/servers are still accepted - a token that doesn't carry the namespace/id segments falls back to the previous index-only comparison - so rolling upgrades keep working.

Changes

  • nomad/state/paginator/tokenizer.go - replace ModifyIndexTokenizer with ModifyIndexAndNamespaceIDTokenizer (the statuses endpoint was its only caller).
  • nomad/job_endpoint_statuses.go - use the new tokenizer.
  • ui/ - treat the pagination token as opaque (no more arithmetic on it); forward/back use a short token history, only "jump to last" derives a cursor. Mirage mock updated to the new token format.

Testing

  • Go: TestJob_Statuses_Pagination_SharedModifyIndex pages a set of jobs that all share a ModifyIndex and asserts every job is returned exactly once and the walk terminates; tokenizer unit tests cover numeric index ordering, the namespace/id tiebreaker, and the legacy bare-integer fallback.
  • UI: a new acceptance test exercises the same through the jobs page; full UI suite passes.
  • Manual: reproduced and verified before/after against live 1-server/1-client clusters (UI + API), including the "Last" and Next-after-Last cases above.

Scope question

I kept this PR deliberately narrow - just the ModifyIndex cursor tiebreaker. While investigating I found a related, separate pagination bug in NamespaceIDTokenizer (two namespaces like team and team-a can stall/duplicate), and there's an open question in #28167 about whether the four tokenizers in this file should be consolidated onto a shared helper. I'm happy to either keep this narrow and track the namespace bug + refactor separately, or expand scope - whatever you'd prefer. Flagging here so the decision is visible rather than baked in.

AI usage

The investigation, the reproduction harness, the design, and the verification are mine: I reproduced all the symptoms on live clusters, ran the before/after A/B by hand (API walks and clicking through both UIs), and confirmed the results. The Go changes (the tokenizer tiebreaker and its tests) are my own work.

For the UI portion I did utilize an AI assistant for implementation help with the JavaScript - I almost never write JavaScript and generally try to avoid it so I always have to look up syntax and libraries, and best practices although this wasn't that involved and I could mostly crib off of the surrounding code/style, so I utilized a bit of claude for the ui/ pagination changes (opaque token + token history), while making sure it followed the prescribed design and fit in with the style of the existing code. I reviewed and understand those changes and reviewed every line of any suggestion I took from it and I stand behind the whole PR that has come out of analysis and testing the last couple days; I just want to be transparent about where the assistance was used. So please pay extra close attention to the JavaScript code because unlike Go I'm not working with JavaScript every day and am more likely to have made subtle mistakes there.

ModifyIndex is not unique across jobs, so once the jobs modify_index index
became non-unique (hashicorp#28158) the /v1/jobs/statuses pagination cursor
(ModifyIndexTokenizer) could no longer identify a unique position: jobs sharing
a ModifyIndex were returned on more than one page, and a group larger than
per_page pinned the cursor so older jobs were unreachable.

Add ModifyIndexAndNamespaceIDTokenizer, which tokenizes on
ModifyIndex + Namespace + ID -- matching the memdb iteration order of the
non-unique index, which breaks ties on the (Namespace, ID) primary key -- with
a legacy bare-integer fallback for rolling upgrades, and use it for
Job.Statuses. Retire the now-unused ModifyIndexTokenizer.

On the web UI, treat the page token as opaque: navigate with a history stack
instead of doing arithmetic on the cursor, and only synthesize a cursor for the
"last" page.

Fixes hashicorp#28167
@afreidah afreidah force-pushed the fix-jobs-statuses-pagination-tokenizer branch from 7df789b to 839f5b3 Compare June 25, 2026 08:51
@afreidah

Copy link
Copy Markdown
Contributor Author

FYI @gulducat this is the follow-up to #28158 that we talked about.

@afreidah

afreidah commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Also, this is what the diff would look like on the full refactor I mentioned in this PR and in the issue.. It actually isn't as bad as I was thinking it was going to be unless I missed something.

main...afreidah:nomad:pagination-tokenizers-shared-helper

If it is preferred to go that direction I can close this PR and push that one, or update this branch to match that one and re-push. Otherwise the narrow bare-minimum fix is ready and I can just file the rest as another gh issue if that is preferable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants