Skip to content

feat: add client-side profiling trigger#326

Open
maxyanghu wants to merge 1 commit into
mlcommons:mainfrom
CentML:feat/maxhu-add-profiler-trigger
Open

feat: add client-side profiling trigger#326
maxyanghu wants to merge 1 commit into
mlcommons:mainfrom
CentML:feat/maxhu-add-profiler-trigger

Conversation

@maxyanghu
Copy link
Copy Markdown
Collaborator

Adds an optional client-side trigger that fires POST /start_profile at the performance phase start and /stop_profile at run end, so a profiled run can be driven from a YAML/CLI flag without coupling endpoints to any vendor harness.

Schema: ProfilerEngine enum (currently {vllm}) and ProfilingConfig hung off Settings. URLs are auto-derived per entry in endpoint_config.endpoints (strip /v1, append engine-specific path). Default-off; warn-don't-fail throughout.

Report.txt gets a Profiling section and a sibling profiling.json is written next to result_summary.json when the trigger is enabled.

closes #324

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

Adds an optional client-side trigger that fires POST /start_profile
at the performance phase start and /stop_profile at run end, so a
profiled run can be driven from a YAML/CLI flag without coupling
endpoints to any vendor harness.

Schema: ProfilerEngine enum (currently {vllm}) and ProfilingConfig
hung off Settings. URLs are auto-derived per entry in
endpoint_config.endpoints (strip /v1, append engine-specific path).
Default-off; warn-don't-fail throughout.

Report.txt gets a Profiling section and a sibling profiling.json is
written next to result_summary.json when the trigger is enabled.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maxyanghu maxyanghu requested a review from a team May 26, 2026 19:47
@github-actions
Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces client-side profiling triggers for inference engines (specifically supporting vLLM) around the performance phase of a benchmark run. It adds a --profile CLI option, config schema updates, and logic to trigger start/stop profiling endpoints. The review feedback highlights a critical issue where the profiling stop requests are synchronous blocking network calls executed within the main event loop during the cleanup phase. This can block the event loop or prevent subsequent cleanup steps if interrupted, and should be run asynchronously using an executor.

Comment on lines +913 to +928
if profile_starts:
stop_reason = "phase_end" if session_completed_normally else "abort"
for i, start_rec in enumerate(profile_starts):
if start_rec["status"] != 200 or i >= len(profile_stop_urls):
continue
rec = _post_profile(profile_stop_urls[i])
rec["stop_reason"] = stop_reason
if rec["status"] == 200:
logger.info("Profile stop: %s -> 200 OK", profile_stop_urls[i])
else:
logger.warning(
"Profile stop: %s -> %s",
profile_stop_urls[i],
rec["error"] or rec["status"],
)
profile_stops.append(rec)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The profiling stop requests are synchronous blocking network calls (_post_profile uses urllib.request.urlopen with a 2-second timeout). Running these sequentially in the main event loop thread inside the finally block can block the event loop for several seconds if the endpoints are slow or unresponsive.

Furthermore, if the user interrupts the cleanup process (e.g., by pressing Ctrl+C a second time because it appears hung), a KeyboardInterrupt will be raised during these blocking calls, which will bypass the rest of the critical cleanup (such as shutting down the HTTP client, closing the publisher, and terminating launcher services).

To prevent this, we should run the profiling stop requests asynchronously in a thread pool using loop.run_in_executor and wrap the entire block in a try...except BaseException to ensure that any network failures or interrupts during profiling stop do not prevent the rest of the cleanup from executing.

            if profile_starts:
                try:
                    stop_reason = "phase_end" if session_completed_normally else "abort"
                    for i, start_rec in enumerate(profile_starts):
                        if start_rec["status"] != 200 or i >= len(profile_stop_urls):
                            continue
                        try:
                            rec = await loop.run_in_executor(
                                None, _post_profile, profile_stop_urls[i]
                            )
                            rec["stop_reason"] = stop_reason
                            if rec["status"] == 200:
                                logger.info("Profile stop: %s -> 200 OK", profile_stop_urls[i])
                            else:
                                logger.warning(
                                    "Profile stop: %s -> %s",
                                    profile_stop_urls[i],
                                    rec["error"] or rec["status"],
                                )
                            profile_stops.append(rec)
                        except Exception as e:
                            logger.warning(
                                "Failed to stop profile for %s: %s",
                                profile_stop_urls[i],
                                e,
                            )
                except BaseException as e:
                    logger.warning("Profiling stop cleanup was interrupted: %s", e)

@wangshangsam wangshangsam requested a review from arekay-nv May 27, 2026 05:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: support /profile_start

1 participant