RFC: Workflow Performance Profiler - full design & implementation walkthrough #5216

PG1204 · 2026-05-26T05:11:25Z

PG1204
May 26, 2026

Following up on @Yicong-Huang's suggestion on PR #5098. This post documents the entire profiler implementation from the hackathon submission, at the level of detail needed to decide what to land and in what order.

What it does

A feedback layer over Texera's existing per-operator runtime stats. While a workflow runs:

Canvas heatmap colors operators cold -> hot under one of three views (Runtime / Throughput / I/O imbalance).
Rule-based hints in the property panel ("filter passes 2% of rows - push it upstream").
Compare to past run - pick any completed execution; canvas recolors green/red by delta.
Downloadable report (Markdown + JSON).
(Optional) Ghost suggestions on canvas for two structural rewrites.

Design principle: read-only consumer of data we already produce. No new event types, no engine changes, no new HTTP endpoints in the critical path. Removing the profiler touches zero backend code paths.

Architecture

OperatorStatisticsUpdateEvent (existing)
│
▼
WorkflowStatusService (existing)
│
▼
ProfilerService ── reactive state (enabled, view, threshold, scores, baseline)
│
├── profiler-config (round-trip via WorkflowContent.profilerConfig)
├── profiler-hints (pure 6-rule engine)
├── profiler-delta (baseline diffing)
├── profiler-history (server-fetched past runs → BaselineReport)
├── profiler-report (MD + JSON export)
└── profiler-suggestions (optional ghost overlays)
│
▼
joint-ui (canvas color) · menu (controls) · property-panel (metrics + hints)

Components

Module	Purpose
ProfilerService	Throttled (500ms) subscriber to status stream; pure computeScores() per view; normalizes to [0,1]; resets on run restart
profiler-config	WorkflowContent.profilerConfig round-trip; defensive parser; equality guard against write-loop
profiler-hints	6 rules: IDLE_HEAVY, JOIN_HIGH_FANIN_LOW_FANOUT, LOW_PARALLELISM_HOT_OP, RUNTIME_OUTLIER, SCAN_FULL_TABLE_NO_FILTER, UPSTREAM_OVERPRODUCTION
profiler-delta	Per-op delta vs baseline; ±5% / <1ms counts as unchanged; surfaces matched / new / removed
profiler-history	Reuses existing /executions/{wid}/stats/{eid}; converts persisted rows → BaselineReport; memoized per (wid, eid)
profiler-report	MD + JSON; JSON is identical to the upload schema (no separate format)
profiler-suggestions (optional)	Structural rewrites (INSERT_FILTER, BUMP_WORKERS); per-workflow dismissed set in localStorage
ProfilerScoring.scala (optional)	Pure object mirroring TS scoring; zero call sites today; only useful if backend ever needs scoring

Test totals on the branch: ~250 Vitest specs (all green), tsc --noEmit clean, ScalaTest covers the scoring helper.

Things I'd like input on

Per-workflow config storage: one optional field on WorkflowContent. OK, or prefer a side table / user setting?
Backend scoring helper: include ProfilerScoring.scala now for parity, or drop until there's a call site?
Ghost suggestions vs agent: they cover overlapping use cases (the agent integration, deferred to a separate RFC, has similar Apply/Reject cards). Land both as complementary, or pick one?
Schema versioning: baseline JSON = exported report JSON. Add an explicit schemaVersion before merging, or trust the defensive parser?
Hint i18n: messages are English-only. Route through i18n now or defer?
Recompute scale: tested up to ~80 operators with no issue, no formal benchmark. Upper bound we care about?

Proposed merge order

7 independent PRs, each useful on its own:

Heatmap foundation: ProfilerService + config + canvas coloring + Execution-Settings controls + legend. (~1.2k LOC w/ tests)
Hints + property-panel metrics + threshold slider.
UX polish: displayName threading, hover tooltip, MD/JSON report download.
Per-workflow config round-trip.
Compare-across-runs + delta heatmap.
(optional) Ghost suggestions.
(optional, backend) ProfilerScoring.scala.
Agent integration -> separate RFC once 1–5 are in.

Happy to open PR 1 once there's directional agreement here. Specific yes/no on "Things I'd like input on" would unblock the most.

Yicong-Huang · 2026-05-26T05:34:38Z

Yicong-Huang
May 26, 2026
Collaborator

I personally very much like this feature, I view this feature as three folds:

displays the power of our engine: very interactive, can quickly gather metadata during execution. super powerful
utilizes our workflow UI, to display/embed information on this UI.
agent suggestions to apply/fix/help user based on the profile results.
So it is a very good unification of interactive engine + workflow UI + agent.

I think your draft architecture makes sense. very good start point! to move forward, we need smaller steps. we may not need all the features, though. so will do the cherry pick carefully.

A few comments:

I would not call it profile, at least not towards user. I think we want it to be low-tech user facing, and they may not understand profile or profiling. I still prefer to keep it as "workflow status". to the end user, it is "what's going on with my workflow?". Currently I think we have "state" and "statistics" two concepts inside status (please correct me if I'm wrong, or we might need to do some refactor). Can you extend a concept of "profile" inside the "WorkflowStatusService"? The WorkflowStatusService should become layer that contains the ground truth information.
it might be good to have different overlay layers, for instance you can introduce "heat map" layer. see other current workflow overlay layers, user can choose to toggle on/off any of those. The layer will just display the corresponding information from WorkflowStatusService.

for your suggest plan, I would not considered it in term of PRs, but in terms of issues. each issue is a todo task. and each issue could correspond to multiple PRs (feature, revert, fix, reapply, doc, etc.)

So high levelly I suggest we open an umbrella issue (don't name it profile/profiler though), and go with those three sub-issues first.

refactor existing service (e.g., WorkflowStatusService) to capture/store the information on the frontend which you need to display. this is purely information level, no UI change yet.
start to add one overlay layer. could be "heat map": which operator is the bottleneck, etc. this is UI visual part.
make sure any information displayed on the frontend should be able to be reconstructed from the backend. meaning that if user refreshes the page, he/she will see the same information.

we can come back to this discussion after you finish the first three.
WDYT?

2 replies

PG1204 May 26, 2026
Author

Sounds good, we can discuss and cherry pick the required features.

Reply to comments:

Yes, currently "WorkflowStatusService" has two concepts - "state" & "statistics". In my current impl, the "ProfilerService" is a separate service which subscribes to "WorkflowStatusService.getStatusUpdateStream()" as its input. Sure, I can extend the profile concept into "WorkflowStatusService", keeping it as the layer that contains the ground truth information.
Worth noting that today operatorState and the aggregated counts share one flat OperatorStatistics record on a single stream, so part of sub-issue 1 will be splitting state and statistics into cleanly separated sub-concepts before adding the third. Also agreed on keeping user-facing language as "workflow status", I'll scrub "profile/profiler" from any UI surface.
Understood. I'll add the Heat Map as a single new sibling toggle in the Layers menu next to Grid / Regions / Workers / Status - reading from WorkflowStatusService and visualizing the new sub-concept. Future overlay layers can come in as separate work later.

Reply to PR plan:
This plan sounds good. I'll go ahead and work the three suggested changes.

Umbrella issue title: since "profile/profiler" is off the table, something like "Workflow status overlays and ground-truth refactor", or do you have any preferred phrasing?
Sub-issue 3 scope: should "same information after refresh" cover both live (mid-execution reconnect) and post-execution (loading a completed run), or only one of those for now?

Is it alright if I add you as the reviewer for all the three sub issues' PRs?

Yicong-Huang May 27, 2026
Collaborator

sg. thanks.

umbrella issue should more or less be a broad project name (instead of describing its steps), lets call it "workflow runtime performance visualization" for now. you can have sub-issues for exact steps. (hint: use commands introduced in Add /sub-issue and /parent-issue comment commands for linking sub-issues from either end #5147).
It is my personal preference to stay away from profile/profiler as it is not that accurate and not user friendly. feel free to reject if you think "workflow runtime performance visualization" is not good though.
for my proposed sub-issue 3, I think it should cover both cases.

regarding review: everyone is welcome to review. for a healthy community, everyone can jump in and share review/comments. feel free to tag me (using commands introduced in #4986) for any PRs if I did not see and comment on it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Workflow Performance Profiler - full design & implementation walkthrough #5216

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RFC: Workflow Performance Profiler - full design & implementation walkthrough #5216

Uh oh!

PG1204 May 26, 2026

What it does

Architecture

Components

Things I'd like input on

Proposed merge order

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

Yicong-Huang May 26, 2026 Collaborator

Uh oh!

PG1204 May 26, 2026 Author

Uh oh!

Yicong-Huang May 27, 2026 Collaborator

PG1204
May 26, 2026

Replies: 1 comment 2 replies

Yicong-Huang
May 26, 2026
Collaborator

PG1204 May 26, 2026
Author

Yicong-Huang May 27, 2026
Collaborator