Fix attribution backend timeout handling by namitdhameja · Pull Request #347 · NVIDIA/nvidia-resiliency-ext

namitdhameja · 2026-06-10T16:07:44Z

Fix lib backend timeout handling by running blocking LogSage work off the event loop.
Keep FT launcher-managed attribution on the supported MCP backend but not lib.
Leave standalone nvrx-attrsvc able to use mcp or lib.

greptile-apps · 2026-06-10T16:13:45Z

Greptile Summary

This PR fixes lib-backend timeout handling in the attribution service by offloading blocking asyncio.run(analyzer.run(...)) to a thread pool via asyncio.to_thread, and restricts launcher-managed attribution to the mcp backend only. The asyncio.shield wrapper lets the coalescer cancel and return a timeout response while the worker thread finishes, and the serialization lock is released by a done callback rather than by the caller to prevent concurrent runs on the singleton analyzer.

log_analyzer.py: New _run_lib_log_analyzer_serialized method offloads blocking lib/LLM work to a thread, uses asyncio.shield so the coalescer can time out without cancelling the underlying worker, and logs a warning if the worker fails after the caller has already timed out.
attribution_manager.py / cli_args.py: _normalize_analysis_backend validates that the launcher-side backend can only be mcp; lib now raises ValueError at construction time, and the CLI choices tuple is narrowed to ("mcp",) to match.
Tests: Two new observability tests cover the coalescer-timeout path and the post-timeout worker-failure warning path.

Confidence Score: 5/5

Safe to merge — the async offloading and lock lifecycle are correct, and the restriction of launcher-managed attribution to the mcp backend is consistently enforced at both the CLI and dataclass construction layers.

The asyncio.shield + asyncio.to_thread pattern correctly gives the coalescer a real await point to cancel on while keeping the singleton analyzer serialized. The done callback is registered before await, the lock is always released by the callback (not the caller), and waiter_cancelled is read only after all cancellation handling has run on the single event loop thread. No logic bugs or data-loss paths were found.

No files require special attention.

Important Files Changed

Filename	Overview
src/nvidia_resiliency_ext/attribution/orchestration/log_analyzer.py	Core fix: adds _run_lib_log_analyzer_serialized using asyncio.to_thread + asyncio.shield to let the coalescer time out without blocking on synchronous LLM work; lock/callback lifecycle is correct and thoroughly tested.
src/nvidia_resiliency_ext/fault_tolerance/attribution_manager.py	Adds post_init validation via _normalize_analysis_backend that rejects 'lib' for launcher-managed attribution; frozen-dataclass pattern with object.setattr is used correctly.
src/nvidia_resiliency_ext/fault_tolerance/cli_args.py	choices narrowed to ("mcp",) with metavar override and type=str.lower for case-insensitive matching; consistent with the new validation in attribution_manager.py.
tests/attribution/unit/test_log_analyzer_observability.py	Two new tests verify coalescer-timeout responsiveness and post-timeout worker-failure logging; broad stub additions unlock the in-process LogSage path for unit testing.
tests/fault_tolerance/unit/test_attribution_manager.py	Updates test fixture to use 'mcp' and adds test_attribution_config_rejects_lib_analysis_backend to cover the new ValueError path.
docs/source/fault_tolerance/usage_guide.rst	Documents the new --ft-attribution-analysis-backend flag and clarifies that launcher-managed attribution only supports the mcp backend.

Sequence Diagram

sequenceDiagram
    participant C as Coalescer
    participant F as _fetch_log_result_lib
    participant S as _run_lib_log_analyzer_serialized
    participant T as asyncio.to_thread (worker task)
    participant Th as Thread (asyncio.run)

    C->>F: "get_or_compute(..., compute_timeout=N)"
    F->>S: await _run_lib_log_analyzer_serialized(analyzer, kwargs)
    S->>S: await _log_analysis_lock.acquire()
    S->>T: asyncio.create_task(asyncio.to_thread(...))
    T->>Th: spawn thread → asyncio.run(analyzer.run(...))
    S->>T: add_done_callback(_finish_lib_run)
    S->>T: await asyncio.shield(worker_task)

    alt Normal completion
        Th-->>T: result
        T->>S: _finish_lib_run: release lock, discard task
        T-->>S: shield resolves → return result
        S-->>F: result
        F-->>C: log_result dict
    else Coalescer timeout (CancelledError)
        C->>S: cancel outer task (via wait_for)
        S->>S: "CancelledError → waiter_cancelled=True, re-raise"
        S-->>C: CancelledError propagates
        C-->>C: "return {state: timeout}"
        Note over T,Th: worker_task continues running in thread
        Th-->>T: completes (success or error)
        T->>T: _finish_lib_run: release lock, log warning if error
    end

_{Reviews (3): Last reviewed commit: "Fix attribution backend timeout handling" | Re-trigger Greptile}

greptile-apps · 2026-06-10T16:13:49Z

+        def _finish_lib_run(task: "asyncio.Task[Any]") -> None:
+            self._lib_log_analysis_tasks.discard(task)
+            if self._log_analysis_lock.locked():
+                self._log_analysis_lock.release()
+            try:
+                task.exception()
+            except asyncio.CancelledError:
+                pass


Worker exceptions silently dropped after caller timeout

When the coalescer times out and cancels _fetch_log_result_lib, asyncio.shield absorbs the cancellation and the worker task keeps running. If that worker later fails with an exception, task.exception() returns the exception object — but the callback neither logs nor re-raises it, so the failure is silently dropped. The caller already received CancelledError, so there is no other path that will surface this error. A logger.warning (or logger.debug) for the non-None return value of task.exception() would make post-timeout failures observable in service logs.

namitdhameja self-assigned this Jun 10, 2026

greptile-apps Bot reviewed Jun 10, 2026

View reviewed changes

namitdhameja force-pushed the nvbug6287896 branch 2 times, most recently from 1134c08 to 53b6565 Compare June 10, 2026 16:30

namitdhameja added the ci-approved Approved to run CI label Jun 10, 2026

namitdhameja requested review from helisha91, hexinw-nvidia and sbak5 June 10, 2026 16:46

Fix attribution backend timeout handling

90947cd

namitdhameja force-pushed the nvbug6287896 branch from 53b6565 to 90947cd Compare June 10, 2026 17:02

hexinw-nvidia approved these changes Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix attribution backend timeout handling#347

Fix attribution backend timeout handling#347
namitdhameja wants to merge 1 commit into
mainfrom
nvbug6287896

namitdhameja commented Jun 10, 2026

Uh oh!

greptile-apps Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

namitdhameja commented Jun 10, 2026

Uh oh!

greptile-apps Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 10, 2026 •

edited

Loading