Skip to content

[core] fix: replace per-run log sinks with single dispatch sink to fix leaks and HDFS stalls#63

Merged
yyDing1 merged 2 commits into
mainfrom
yy/fix-async-logging
Jun 15, 2026
Merged

[core] fix: replace per-run log sinks with single dispatch sink to fix leaks and HDFS stalls#63
yyDing1 merged 2 commits into
mainfrom
yy/fix-async-logging

Conversation

@yyDing1

@yyDing1 yyDing1 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

Rework the per-run logging backend so it survives large-scale concurrent
training. Replaces the "one loguru sink per run" design with a single global
dispatch sink, fixes a sink/thread/fd leak, and stops per-record flushes from
stalling the writer thread when log_dir lives on HDFS/FUSE.

Problem

The previous implementation called logger.add(<file>) once per run:

  • O(num_runs) per message: every record fanned out across all live sinks,
    each re-evaluating a per-run case_filter. Cost grew with the number of
    concurrent runs.
  • Leak: each run added a sink + background thread + open file descriptor.
    If cleanup_handlers wasn't called (and it wasn't, on the training path),
    these accumulated until logging fell over / fds were exhausted.
  • HDFS stalls: the sink flushed every record. On an HDFS/FUSE-backed
    log_dir, each flush is a synchronous remote write(); with a single shared
    writer thread, one slow flush blocked logging for all runs.
  • DEBUG_MODE parsing treated "0"/"false"/"" as enabled (non-empty string
    is truthy).

What changed

uni_agent/async_logging.py

  • Single dispatch sink: one logger.add(_dispatch, enqueue=True) routes each
    record to its run's file in O(1) by run_id, backed by one background writer
    thread.
  • Real cleanup: cleanup_handlers(run_id) closes the run's file object and
    drops it from the registry (no more leaked sinks/threads/fds).
  • Block buffering, no per-line flush by default: lines accumulate in-process
    and reach the OS ~every io.DEFAULT_BUFFER_SIZE or on close(), drastically
    cutting HDFS/FUSE write() frequency. Set LOG_FLUSH_EACH_LINE=1 for
    near-real-time tailing on local disks.
  • Narrowed lock scope: the lock only guards the registry lookup; file I/O
    runs outside it, so a slow write can't block run registration/cleanup. The
    write-after-close race is handled by catching ValueError/OSError.
  • Fixed DEBUG_MODE parsing via explicit truthy-value parsing.
    uni_agent/agent_loop.py
  • Call cleanup_handlers(self.run_id) in UniAgentLoop.run()'s finally block
    so the per-run file is closed on the training path (success and failure).

yyDing1 and others added 2 commits June 15, 2026 17:26
…urrency

Replace the "one loguru sink per run" design with a single global dispatch
sink that routes each record to its run's file in O(1) by run_id, backed by
one background writer thread. Previously every record fanned out across all
live sinks (O(num_runs) filtering per message) and each run leaked a sink +
thread + file descriptor when not cleaned up, which is the main reason logging
fell over under large-scale concurrent training.

Also call cleanup_handlers(run_id) in UniAgentLoop.run()'s finally block so the
per-run file is closed on the training path, and fix DEBUG_MODE parsing so
values like "0"/"false" are no longer treated as enabled.

Co-authored-by: Cursor <cursoragent@cursor.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the logging system to use a single global Loguru sink with asynchronous queueing and custom dispatching to run-specific files, replacing the previous dynamic handler registry. In the agent loop, a cleanup call is added to close these files when a run finishes. The review feedback identifies a critical issue where closing the file descriptor immediately in cleanup_handlers can cause pending asynchronous log messages to be lost. To resolve this, it is recommended to make cleanup_handlers asynchronous and await logger.complete() before closing the file, and consequently await cleanup_handlers in the agent loop's finally block.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +84 to +91
def cleanup_handlers(run_id: str) -> None:
with _lock:
entry = _run_files.pop(run_id, None)
if entry is not None:
try:
entry[0].close()
except OSError:
pass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Since the global loguru sink is configured with enqueue=True, log messages are processed asynchronously in a background thread. If cleanup_handlers immediately pops and closes the file object, any pending log messages still in the queue for this run_id will be silently discarded because the file will either be closed or removed from _run_files before the background thread can process them.

To prevent losing critical trailing logs (such as final metrics, status, or error details), cleanup_handlers should be made asynchronous and await logger.complete() to ensure all enqueued log messages are fully flushed and written before the file is closed.

Suggested change
def cleanup_handlers(run_id: str) -> None:
with _lock:
entry = _run_files.pop(run_id, None)
if entry is not None:
try:
entry[0].close()
except OSError:
pass
async def cleanup_handlers(run_id: str) -> None:
await logger.complete()
with _lock:
entry = _run_files.pop(run_id, None)
if entry is not None:
try:
entry[0].close()
except OSError:
pass

Comment thread uni_agent/agent_loop.py
output = await self._build_empty_agent_output(exit_reason="agent_loop_failed")
finally:
await self.env.close()
cleanup_handlers(self.run_id)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Since cleanup_handlers needs to be asynchronous to await the flushing of enqueued log messages, we must await its call here in the finally block.

Suggested change
cleanup_handlers(self.run_id)
await cleanup_handlers(self.run_id)

@yyDing1 yyDing1 merged commit 0754fac into main Jun 15, 2026
3 checks passed
@yyDing1 yyDing1 deleted the yy/fix-async-logging branch June 15, 2026 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant