Fix /v1/jobs/statuses pagination for jobs that share a ModifyIndex#28178
Open
afreidah wants to merge 1 commit into
Open
Fix /v1/jobs/statuses pagination for jobs that share a ModifyIndex#28178afreidah wants to merge 1 commit into
afreidah wants to merge 1 commit into
Conversation
ModifyIndex is not unique across jobs, so once the jobs modify_index index became non-unique (hashicorp#28158) the /v1/jobs/statuses pagination cursor (ModifyIndexTokenizer) could no longer identify a unique position: jobs sharing a ModifyIndex were returned on more than one page, and a group larger than per_page pinned the cursor so older jobs were unreachable. Add ModifyIndexAndNamespaceIDTokenizer, which tokenizes on ModifyIndex + Namespace + ID -- matching the memdb iteration order of the non-unique index, which breaks ties on the (Namespace, ID) primary key -- with a legacy bare-integer fallback for rolling upgrades, and use it for Job.Statuses. Retire the now-unused ModifyIndexTokenizer. On the web UI, treat the page token as opaque: navigate with a history stack instead of doing arithmetic on the cursor, and only synthesize a cursor for the "last" page. Fixes hashicorp#28167
7df789b to
839f5b3
Compare
Contributor
Author
Contributor
Author
|
Also, this is what the diff would look like on the full refactor I mentioned in this PR and in the issue.. It actually isn't as bad as I was thinking it was going to be unless I missed something. main...afreidah:nomad:pagination-tokenizers-shared-helper If it is preferred to go that direction I can close this PR and push that one, or update this branch to match that one and re-push. Otherwise the narrow bare-minimum fix is ready and I can just file the rest as another gh issue if that is preferable. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
/v1/jobs/statuses(which backs the UI jobs page) paginates with anext_tokencursor built from each job'sModifyIndexalone. That only worked whileModifyIndexwas unique per job. #28158 made the jobsmodify_indexindex non-unique - several jobs can legitimately share aModifyIndexwhen written in one Raft transaction - and once that's true, aModifyIndex-only cursor no longer identifies a single position in the list.This is the narrow fix for that: give the cursor a tiebreaker so it points at exactly one job -
ModifyIndex+Namespace+ID, compared numerically on the index, then by namespace, then by id. That matches the order the state store actually walks the non-uniquemodify_indexindex (which breaks ties on the(Namespace, ID)primary key), so paging lines up with the data.Relates to #28167.
Symptoms fixed
With jobs sharing a
ModifyIndexand pagination on (e.g. 30 jobs,per_page=25):ModifyIndexgroup is larger thanper_page, the cursor never advances past it; the same page repeats and older jobs become unreachable.All four share one root cause and are fixed by the single tokenizer change.
Reproduction
Same Docker A/B harness as #28132/#28158 (1 server + 1 client, no Consul). Jobs are forced to share a
ModifyIndexby giving every alloc the same absolute exit epoch, so their status writes coalesce into one Raft transaction:https://github.com/afreidah/nomad/tree/repro-jobs-statuses-28132/repro-28132
Walking
/v1/jobs/statuseswith a smallper_pageover jobs that share aModifyIndex, before the fix:After the fix the same walk returns every job exactly once and terminates.
Compatibility
Bare-integer tokens minted by older clients/servers are still accepted - a token that doesn't carry the namespace/id segments falls back to the previous index-only comparison - so rolling upgrades keep working.
Changes
nomad/state/paginator/tokenizer.go- replaceModifyIndexTokenizerwithModifyIndexAndNamespaceIDTokenizer(the statuses endpoint was its only caller).nomad/job_endpoint_statuses.go- use the new tokenizer.ui/- treat the pagination token as opaque (no more arithmetic on it); forward/back use a short token history, only "jump to last" derives a cursor. Mirage mock updated to the new token format.Testing
TestJob_Statuses_Pagination_SharedModifyIndexpages a set of jobs that all share aModifyIndexand asserts every job is returned exactly once and the walk terminates; tokenizer unit tests cover numeric index ordering, the namespace/id tiebreaker, and the legacy bare-integer fallback.Scope question
I kept this PR deliberately narrow - just the
ModifyIndexcursor tiebreaker. While investigating I found a related, separate pagination bug inNamespaceIDTokenizer(two namespaces liketeamandteam-acan stall/duplicate), and there's an open question in #28167 about whether the four tokenizers in this file should be consolidated onto a shared helper. I'm happy to either keep this narrow and track the namespace bug + refactor separately, or expand scope - whatever you'd prefer. Flagging here so the decision is visible rather than baked in.AI usage
The investigation, the reproduction harness, the design, and the verification are mine: I reproduced all the symptoms on live clusters, ran the before/after A/B by hand (API walks and clicking through both UIs), and confirmed the results. The Go changes (the tokenizer tiebreaker and its tests) are my own work.
For the UI portion I did utilize an AI assistant for implementation help with the JavaScript - I almost never write JavaScript and generally try to avoid it so I always have to look up syntax and libraries, and best practices although this wasn't that involved and I could mostly crib off of the surrounding code/style, so I utilized a bit of claude for the
ui/pagination changes (opaque token + token history), while making sure it followed the prescribed design and fit in with the style of the existing code. I reviewed and understand those changes and reviewed every line of any suggestion I took from it and I stand behind the whole PR that has come out of analysis and testing the last couple days; I just want to be transparent about where the assistance was used. So please pay extra close attention to the JavaScript code because unlike Go I'm not working with JavaScript every day and am more likely to have made subtle mistakes there.