Skip to content

fix(runs): seed WatchActions child phase counts so large map-task counts are correct#7589

Draft
pingsutw wants to merge 1 commit into
mainfrom
fix/watch-actions-aggregate-counts
Draft

fix(runs): seed WatchActions child phase counts so large map-task counts are correct#7589
pingsutw wants to merge 1 commit into
mainfrom
fix/watch-actions-aggregate-counts

Conversation

@pingsutw

Copy link
Copy Markdown
Member

Why are the changes needed?

On the run detail page, a large map task's child count caps far below the real total — e.g. it shows 4.4K for a run with 20000 children.

WatchActions builds children_phase_counts by upserting every child action into the run-state tree as it streams the initial snapshot. For a 20k-child map task that snapshot is ~20 MB and takes ~25 s to stream. The console's connect-web client closes the WatchActions stream after a few seconds (a client-side stream deadline — confirmed: the backend and ALB both deliver the full 20000 when read without that deadline, via port-forward and via buf curl through the ALB). So the count freezes mid-climb at whatever arrived first (~4.4K).

You fundamentally cannot deliver a 20k-child count by streaming 20k rows under a multi-second client deadline.

Related to #7585 (which fixes the separate Offset paging bug in the same snapshot loop).

What changes were proposed in this pull request?

Seed children_phase_counts up front so it is correct from the first streamed page, instead of climbing as every child streams in:

  • New repo method ActionRepo.ListActionPhasesForCounts(runID) — one lightweight query selecting only name, parent_action_name, phase, created_at (no large action_spec/action_details/... bytea columns), so the whole run loads fast in a single query, ordered created_at ASC, name ASC (parents before children).
  • listAndSendAllActions pre-populates the run-state tree from that query before the streaming loop (state only — the per-node updates are discarded; the streaming loop re-sends them with full action data).
  • Re-streaming the same rows in the existing loop is count-neutral: upsertAction's update path calls modifyPhaseCounters(parent, newPhase, oldPhase) with newPhase == oldPhase, a net no-op — so the seeded total neither doubles nor regresses. Genuine live phase changes after the snapshot still adjust counts as before.

This mirrors how Union Cloud already serves these counts — via SQL aggregation (COUNT(*) FILTER (WHERE phase = …) GROUP BY …) rather than streaming every child.

The per-action leaf stream (for the tree/list) is unchanged; this PR only fixes the count. Making the full child list robust for huge runs (task grouping + lazy expansion, as cloud does) is a larger follow-up.

ponytail: ListActionPhasesForCounts loads all rows for the run into memory. Fine to ~100k actions; beyond that switch to a recursive SQL CTE or paginate. Noted in a comment at the call site.

How was this patch tested?

  • TestRunStateManagerSeedThenRestreamIsCountNeutral (no DB): seeds root→mapTask→150 children, asserts counts are correct after seeding (including the transitive count on the root), stay put when the same rows are re-streamed in pages, and still adjust on a real live phase change.
  • TestListActionPhasesForCounts (embedded Postgres): root + 120 children with mixed phases; asserts the query returns every action with name/parent/phase populated, correct per-phase totals, and created_at ASC ordering.
  • Full runs/service and runs/repository/impl suites pass.

Not deployed yet (load tests in progress on the shared dev cluster).

Labels

  • fixed

Check all the applicable boxes

  • All new and existing tests passed.
  • All commits are signed-off.

…nts are correct

WatchActions builds childrenPhaseCounts by upserting every child action into
the run-state tree as it streams the snapshot. For a large map task (e.g. 20k
children) that snapshot takes ~25s to stream, and the console closes the
stream after a few seconds -- so the count freezes far below the real total
(observed 4.4K instead of 20000).

Seed the counts up front from a single lightweight query
(ListActionPhasesForCounts: name, parent, phase only -- no large bytea
columns), so childrenPhaseCounts is correct from the first streamed page.
Re-streaming the same rows in the existing snapshot loop is count-neutral
(same phase => no-op in modifyPhaseCounters), so nothing regresses. This
mirrors how Union Cloud already serves these counts via SQL aggregation
instead of streaming every child.

Signed-off-by: Kevin Su <pingsutw@apache.org>
Copilot AI review requested due to automatic review settings June 24, 2026 08:28

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes incorrect children_phase_counts for large map tasks in WatchActions by pre-seeding phase counts from a lightweight DB query before streaming the full action snapshot, so aggregate counts are correct immediately even if the client stream is cut short.

Changes:

  • Added ActionRepo.ListActionPhasesForCounts(runID) and used it in WatchActions.listAndSendAllActions to pre-populate the run-state tree’s phase aggregates.
  • Added/updated unit and integration tests to validate count-neutral re-streaming behavior and the new repository query.
  • Extended repository mocks and interface to support the new repo method.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
runs/service/run_state_manager_test.go Adds a test ensuring “seed then re-stream” is count-neutral and live phase changes still adjust counts.
runs/service/run_service.go Seeds child phase counts via a lightweight repo query before snapshot streaming in WatchActions.
runs/service/run_service_test.go Updates tests to account for the new seed query call.
runs/repository/mocks/mocks.go Adds mock support for ListActionPhasesForCounts.
runs/repository/interfaces/action.go Adds the new repo method to the ActionRepo interface.
runs/repository/impl/action.go Implements ListActionPhasesForCounts with a lightweight SELECT.
runs/repository/impl/action_test.go Adds an embedded-Postgres test validating the new query behavior.
Files not reviewed (1)
  • runs/repository/mocks/mocks.go: Generated file

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +374 to +386
expr, err := NewRunActionsFilter(runID).QueryExpression("")
if err != nil {
return nil, fmt.Errorf("failed to build filter expression: %w", err)
}

query := sqlx.Rebind(sqlx.DOLLAR,
"SELECT name, parent_action_name, phase, created_at FROM actions WHERE "+
expr.Query+" ORDER BY created_at ASC, name ASC")

var actions []*models.Action
if err := sqlx.SelectContext(ctx, r.db, &actions, query, expr.Args...); err != nil {
return nil, fmt.Errorf("failed to list action phases: %w", err)
}
Comment on lines +29 to +31
// ListActionPhasesForCounts returns lightweight rows (name, parent, phase) for
// every action in a run, used to seed child phase counts without streaming all
// children. See the impl for details.
@pingsutw pingsutw marked this pull request as draft June 24, 2026 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants