Skip to content

Fix attribution backend timeout handling#347

Open
namitdhameja wants to merge 1 commit into
mainfrom
nvbug6287896
Open

Fix attribution backend timeout handling#347
namitdhameja wants to merge 1 commit into
mainfrom
nvbug6287896

Conversation

@namitdhameja

Copy link
Copy Markdown
Contributor
  • Fix lib backend timeout handling by running blocking LogSage work off the event loop.
  • Keep FT launcher-managed attribution on the supported MCP backend but not lib.
  • Leave standalone nvrx-attrsvc able to use mcp or lib.

@namitdhameja namitdhameja self-assigned this Jun 10, 2026
@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes lib-backend timeout handling in the attribution service by offloading blocking asyncio.run(analyzer.run(...)) to a thread pool via asyncio.to_thread, and restricts launcher-managed attribution to the mcp backend only. The asyncio.shield wrapper lets the coalescer cancel and return a timeout response while the worker thread finishes, and the serialization lock is released by a done callback rather than by the caller to prevent concurrent runs on the singleton analyzer.

  • log_analyzer.py: New _run_lib_log_analyzer_serialized method offloads blocking lib/LLM work to a thread, uses asyncio.shield so the coalescer can time out without cancelling the underlying worker, and logs a warning if the worker fails after the caller has already timed out.
  • attribution_manager.py / cli_args.py: _normalize_analysis_backend validates that the launcher-side backend can only be mcp; lib now raises ValueError at construction time, and the CLI choices tuple is narrowed to ("mcp",) to match.
  • Tests: Two new observability tests cover the coalescer-timeout path and the post-timeout worker-failure warning path.

Confidence Score: 5/5

Safe to merge — the async offloading and lock lifecycle are correct, and the restriction of launcher-managed attribution to the mcp backend is consistently enforced at both the CLI and dataclass construction layers.

The asyncio.shield + asyncio.to_thread pattern correctly gives the coalescer a real await point to cancel on while keeping the singleton analyzer serialized. The done callback is registered before await, the lock is always released by the callback (not the caller), and waiter_cancelled is read only after all cancellation handling has run on the single event loop thread. No logic bugs or data-loss paths were found.

No files require special attention.

Important Files Changed

Filename Overview
src/nvidia_resiliency_ext/attribution/orchestration/log_analyzer.py Core fix: adds _run_lib_log_analyzer_serialized using asyncio.to_thread + asyncio.shield to let the coalescer time out without blocking on synchronous LLM work; lock/callback lifecycle is correct and thoroughly tested.
src/nvidia_resiliency_ext/fault_tolerance/attribution_manager.py Adds post_init validation via _normalize_analysis_backend that rejects 'lib' for launcher-managed attribution; frozen-dataclass pattern with object.setattr is used correctly.
src/nvidia_resiliency_ext/fault_tolerance/cli_args.py choices narrowed to ("mcp",) with metavar override and type=str.lower for case-insensitive matching; consistent with the new validation in attribution_manager.py.
tests/attribution/unit/test_log_analyzer_observability.py Two new tests verify coalescer-timeout responsiveness and post-timeout worker-failure logging; broad stub additions unlock the in-process LogSage path for unit testing.
tests/fault_tolerance/unit/test_attribution_manager.py Updates test fixture to use 'mcp' and adds test_attribution_config_rejects_lib_analysis_backend to cover the new ValueError path.
docs/source/fault_tolerance/usage_guide.rst Documents the new --ft-attribution-analysis-backend flag and clarifies that launcher-managed attribution only supports the mcp backend.

Sequence Diagram

sequenceDiagram
    participant C as Coalescer
    participant F as _fetch_log_result_lib
    participant S as _run_lib_log_analyzer_serialized
    participant T as asyncio.to_thread (worker task)
    participant Th as Thread (asyncio.run)

    C->>F: "get_or_compute(..., compute_timeout=N)"
    F->>S: await _run_lib_log_analyzer_serialized(analyzer, kwargs)
    S->>S: await _log_analysis_lock.acquire()
    S->>T: asyncio.create_task(asyncio.to_thread(...))
    T->>Th: spawn thread → asyncio.run(analyzer.run(...))
    S->>T: add_done_callback(_finish_lib_run)
    S->>T: await asyncio.shield(worker_task)

    alt Normal completion
        Th-->>T: result
        T->>S: _finish_lib_run: release lock, discard task
        T-->>S: shield resolves → return result
        S-->>F: result
        F-->>C: log_result dict
    else Coalescer timeout (CancelledError)
        C->>S: cancel outer task (via wait_for)
        S->>S: "CancelledError → waiter_cancelled=True, re-raise"
        S-->>C: CancelledError propagates
        C-->>C: "return {state: timeout}"
        Note over T,Th: worker_task continues running in thread
        Th-->>T: completes (success or error)
        T->>T: _finish_lib_run: release lock, log warning if error
    end
Loading

Reviews (3): Last reviewed commit: "Fix attribution backend timeout handling" | Re-trigger Greptile

Comment on lines +157 to +164
def _finish_lib_run(task: "asyncio.Task[Any]") -> None:
self._lib_log_analysis_tasks.discard(task)
if self._log_analysis_lock.locked():
self._log_analysis_lock.release()
try:
task.exception()
except asyncio.CancelledError:
pass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Worker exceptions silently dropped after caller timeout

When the coalescer times out and cancels _fetch_log_result_lib, asyncio.shield absorbs the cancellation and the worker task keeps running. If that worker later fails with an exception, task.exception() returns the exception object — but the callback neither logs nor re-raises it, so the failure is silently dropped. The caller already received CancelledError, so there is no other path that will surface this error. A logger.warning (or logger.debug) for the non-None return value of task.exception() would make post-timeout failures observable in service logs.

@namitdhameja namitdhameja force-pushed the nvbug6287896 branch 2 times, most recently from 1134c08 to 53b6565 Compare June 10, 2026 16:30
@namitdhameja namitdhameja added the ci-approved Approved to run CI label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-approved Approved to run CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants