fix(runs): seed WatchActions child phase counts so large map-task counts are correct#7589
Draft
pingsutw wants to merge 1 commit into
Draft
fix(runs): seed WatchActions child phase counts so large map-task counts are correct#7589pingsutw wants to merge 1 commit into
pingsutw wants to merge 1 commit into
Conversation
…nts are correct WatchActions builds childrenPhaseCounts by upserting every child action into the run-state tree as it streams the snapshot. For a large map task (e.g. 20k children) that snapshot takes ~25s to stream, and the console closes the stream after a few seconds -- so the count freezes far below the real total (observed 4.4K instead of 20000). Seed the counts up front from a single lightweight query (ListActionPhasesForCounts: name, parent, phase only -- no large bytea columns), so childrenPhaseCounts is correct from the first streamed page. Re-streaming the same rows in the existing snapshot loop is count-neutral (same phase => no-op in modifyPhaseCounters), so nothing regresses. This mirrors how Union Cloud already serves these counts via SQL aggregation instead of streaming every child. Signed-off-by: Kevin Su <pingsutw@apache.org>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes incorrect children_phase_counts for large map tasks in WatchActions by pre-seeding phase counts from a lightweight DB query before streaming the full action snapshot, so aggregate counts are correct immediately even if the client stream is cut short.
Changes:
- Added
ActionRepo.ListActionPhasesForCounts(runID)and used it inWatchActions.listAndSendAllActionsto pre-populate the run-state tree’s phase aggregates. - Added/updated unit and integration tests to validate count-neutral re-streaming behavior and the new repository query.
- Extended repository mocks and interface to support the new repo method.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| runs/service/run_state_manager_test.go | Adds a test ensuring “seed then re-stream” is count-neutral and live phase changes still adjust counts. |
| runs/service/run_service.go | Seeds child phase counts via a lightweight repo query before snapshot streaming in WatchActions. |
| runs/service/run_service_test.go | Updates tests to account for the new seed query call. |
| runs/repository/mocks/mocks.go | Adds mock support for ListActionPhasesForCounts. |
| runs/repository/interfaces/action.go | Adds the new repo method to the ActionRepo interface. |
| runs/repository/impl/action.go | Implements ListActionPhasesForCounts with a lightweight SELECT. |
| runs/repository/impl/action_test.go | Adds an embedded-Postgres test validating the new query behavior. |
Files not reviewed (1)
- runs/repository/mocks/mocks.go: Generated file
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+374
to
+386
| expr, err := NewRunActionsFilter(runID).QueryExpression("") | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to build filter expression: %w", err) | ||
| } | ||
|
|
||
| query := sqlx.Rebind(sqlx.DOLLAR, | ||
| "SELECT name, parent_action_name, phase, created_at FROM actions WHERE "+ | ||
| expr.Query+" ORDER BY created_at ASC, name ASC") | ||
|
|
||
| var actions []*models.Action | ||
| if err := sqlx.SelectContext(ctx, r.db, &actions, query, expr.Args...); err != nil { | ||
| return nil, fmt.Errorf("failed to list action phases: %w", err) | ||
| } |
Comment on lines
+29
to
+31
| // ListActionPhasesForCounts returns lightweight rows (name, parent, phase) for | ||
| // every action in a run, used to seed child phase counts without streaming all | ||
| // children. See the impl for details. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are the changes needed?
On the run detail page, a large map task's child count caps far below the real total — e.g. it shows 4.4K for a run with 20000 children.
WatchActionsbuildschildren_phase_countsby upserting every child action into the run-state tree as it streams the initial snapshot. For a 20k-child map task that snapshot is ~20 MB and takes ~25 s to stream. The console's connect-web client closes theWatchActionsstream after a few seconds (a client-side stream deadline — confirmed: the backend and ALB both deliver the full 20000 when read without that deadline, via port-forward and viabuf curlthrough the ALB). So the count freezes mid-climb at whatever arrived first (~4.4K).You fundamentally cannot deliver a 20k-child count by streaming 20k rows under a multi-second client deadline.
Related to #7585 (which fixes the separate
Offsetpaging bug in the same snapshot loop).What changes were proposed in this pull request?
Seed
children_phase_countsup front so it is correct from the first streamed page, instead of climbing as every child streams in:ActionRepo.ListActionPhasesForCounts(runID)— one lightweight query selecting onlyname, parent_action_name, phase, created_at(no largeaction_spec/action_details/... bytea columns), so the whole run loads fast in a single query, orderedcreated_at ASC, name ASC(parents before children).listAndSendAllActionspre-populates the run-state tree from that query before the streaming loop (state only — the per-node updates are discarded; the streaming loop re-sends them with full action data).upsertAction's update path callsmodifyPhaseCounters(parent, newPhase, oldPhase)withnewPhase == oldPhase, a net no-op — so the seeded total neither doubles nor regresses. Genuine live phase changes after the snapshot still adjust counts as before.This mirrors how Union Cloud already serves these counts — via SQL aggregation (
COUNT(*) FILTER (WHERE phase = …) GROUP BY …) rather than streaming every child.The per-action leaf stream (for the tree/list) is unchanged; this PR only fixes the count. Making the full child list robust for huge runs (task grouping + lazy expansion, as cloud does) is a larger follow-up.
How was this patch tested?
TestRunStateManagerSeedThenRestreamIsCountNeutral(no DB): seeds root→mapTask→150 children, asserts counts are correct after seeding (including the transitive count on the root), stay put when the same rows are re-streamed in pages, and still adjust on a real live phase change.TestListActionPhasesForCounts(embedded Postgres): root + 120 children with mixed phases; asserts the query returns every action withname/parent/phasepopulated, correct per-phase totals, andcreated_at ASCordering.runs/serviceandruns/repository/implsuites pass.Not deployed yet (load tests in progress on the shared dev cluster).
Labels
Check all the applicable boxes