Fix /v1/jobs/statuses pagination for jobs that share a ModifyIndex by afreidah · Pull Request #28178 · hashicorp/nomad

afreidah · 2026-06-25T08:15:04Z

What

/v1/jobs/statuses (which backs the UI jobs page) paginates with a next_token cursor built from each job's ModifyIndex alone. That only worked while ModifyIndex was unique per job. #28158 made the jobs modify_index index non-unique - several jobs can legitimately share a ModifyIndex when written in one Raft transaction - and once that's true, a ModifyIndex-only cursor no longer identifies a single position in the list.

This is the narrow fix for that: give the cursor a tiebreaker so it points at exactly one job - ModifyIndex + Namespace + ID, compared numerically on the index, then by namespace, then by id. That matches the order the state store actually walks the non-unique modify_index index (which breaks ties on the (Namespace, ID) primary key), so paging lines up with the data.

Relates to #28167.

Symptoms fixed

With jobs sharing a ModifyIndex and pagination on (e.g. 30 jobs, per_page=25):

Duplicate rows - jobs at a page boundary come back again on the next page.
Stall / unreachable jobs - if a shared-ModifyIndex group is larger than per_page, the cursor never advances past it; the same page repeats and older jobs become unreachable.
Broken "Last" page - "jump to last" landed on the wrong boundary inside a tied group, dropping the very oldest jobs and pulling in newer ones.
Ghost page after "Last" - because "Last" wasn't actually last, Next stayed enabled and revealed a few more (already-/never-shown) jobs past it.

All four share one root cause and are fixed by the single tokenizer change.

Reproduction

Same Docker A/B harness as #28132/#28158 (1 server + 1 client, no Consul). Jobs are forced to share a ModifyIndex by giving every alloc the same absolute exit epoch, so their status writes coalesce into one Raft transaction:
https://github.com/afreidah/nomad/tree/repro-jobs-statuses-28132/repro-28132

Walking /v1/jobs/statuses with a small per_page over jobs that share a ModifyIndex, before the fix:

page  1:  6 job(s)   next_token=77
page  2:  6 job(s)   next_token=76
page  3:  6 job(s)   next_token=76   <- same token it was given; repeats forever
duplicated:      j03 j04 j05 j20
never returned:  j09 j10 j11 j12 j13 j14

After the fix the same walk returns every job exactly once and terminates.

Compatibility

Bare-integer tokens minted by older clients/servers are still accepted - a token that doesn't carry the namespace/id segments falls back to the previous index-only comparison - so rolling upgrades keep working.

Changes

nomad/state/paginator/tokenizer.go - replace ModifyIndexTokenizer with ModifyIndexAndNamespaceIDTokenizer (the statuses endpoint was its only caller).
nomad/job_endpoint_statuses.go - use the new tokenizer.
ui/ - treat the pagination token as opaque (no more arithmetic on it); forward/back use a short token history, only "jump to last" derives a cursor. Mirage mock updated to the new token format.

Testing

Go: TestJob_Statuses_Pagination_SharedModifyIndex pages a set of jobs that all share a ModifyIndex and asserts every job is returned exactly once and the walk terminates; tokenizer unit tests cover numeric index ordering, the namespace/id tiebreaker, and the legacy bare-integer fallback.
UI: a new acceptance test exercises the same through the jobs page; full UI suite passes.
Manual: reproduced and verified before/after against live 1-server/1-client clusters (UI + API), including the "Last" and Next-after-Last cases above.

Scope question

I kept this PR deliberately narrow - just the ModifyIndex cursor tiebreaker. While investigating I found a related, separate pagination bug in NamespaceIDTokenizer (two namespaces like team and team-a can stall/duplicate), and there's an open question in #28167 about whether the four tokenizers in this file should be consolidated onto a shared helper. I'm happy to either keep this narrow and track the namespace bug + refactor separately, or expand scope - whatever you'd prefer. Flagging here so the decision is visible rather than baked in.

AI usage

The investigation, the reproduction harness, the design, and the verification are mine: I reproduced all the symptoms on live clusters, ran the before/after A/B by hand (API walks and clicking through both UIs), and confirmed the results. The Go changes (the tokenizer tiebreaker and its tests) are my own work.

For the UI portion I did utilize an AI assistant for implementation help with the JavaScript - I almost never write JavaScript and generally try to avoid it so I always have to look up syntax and libraries, and best practices although this wasn't that involved and I could mostly crib off of the surrounding code/style, so I utilized a bit of claude for the ui/ pagination changes (opaque token + token history), while making sure it followed the prescribed design and fit in with the style of the existing code. I reviewed and understand those changes and reviewed every line of any suggestion I took from it and I stand behind the whole PR that has come out of analysis and testing the last couple days; I just want to be transparent about where the assistance was used. So please pay extra close attention to the JavaScript code because unlike Go I'm not working with JavaScript every day and am more likely to have made subtle mistakes there.

ModifyIndex is not unique across jobs, so once the jobs modify_index index became non-unique (hashicorp#28158) the /v1/jobs/statuses pagination cursor (ModifyIndexTokenizer) could no longer identify a unique position: jobs sharing a ModifyIndex were returned on more than one page, and a group larger than per_page pinned the cursor so older jobs were unreachable. Add ModifyIndexAndNamespaceIDTokenizer, which tokenizes on ModifyIndex + Namespace + ID -- matching the memdb iteration order of the non-unique index, which breaks ties on the (Namespace, ID) primary key -- with a legacy bare-integer fallback for rolling upgrades, and use it for Job.Statuses. Retire the now-unused ModifyIndexTokenizer. On the web UI, treat the page token as opaque: navigate with a history stack instead of doing arithmetic on the cursor, and only synthesize a cursor for the "last" page. Fixes hashicorp#28167

afreidah · 2026-06-25T09:35:10Z

FYI @gulducat this is the follow-up to #28158 that we talked about.

afreidah · 2026-06-25T11:02:03Z

Also, this is what the diff would look like on the full refactor I mentioned in this PR and in the issue.. It actually isn't as bad as I was thinking it was going to be unless I missed something.

main...afreidah:nomad:pagination-tokenizers-shared-helper

If it is preferred to go that direction I can close this PR and push that one, or update this branch to match that one and re-push. Otherwise the narrow bare-minimum fix is ready and I can just file the rest as another gh issue if that is preferable.

afreidah requested review from a team as code owners June 25, 2026 08:15

jrasell added this to Nomad - Community Issues Triage Jun 25, 2026

github-project-automation Bot moved this to Needs Triage in Nomad - Community Issues Triage Jun 25, 2026

afreidah force-pushed the fix-jobs-statuses-pagination-tokenizer branch from 7df789b to 839f5b3 Compare June 25, 2026 08:51

afreidah mentioned this pull request Jun 25, 2026

Unify pagination tokenizers on a shared helper (+ fix Namespace+ID cursor ordering) afreidah/nomad#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix /v1/jobs/statuses pagination for jobs that share a ModifyIndex#28178

Fix /v1/jobs/statuses pagination for jobs that share a ModifyIndex#28178
afreidah wants to merge 1 commit into
hashicorp:mainfrom
afreidah:fix-jobs-statuses-pagination-tokenizer

afreidah commented Jun 25, 2026 •

edited

Loading

Uh oh!

afreidah commented Jun 25, 2026

Uh oh!

afreidah commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

afreidah commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Symptoms fixed

Reproduction

Compatibility

Changes

Testing

Scope question

AI usage

Uh oh!

afreidah commented Jun 25, 2026

Uh oh!

afreidah commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

afreidah commented Jun 25, 2026 •

edited

Loading

afreidah commented Jun 25, 2026 •

edited

Loading