[core] fix: replace per-run log sinks with single dispatch sink to fix leaks and HDFS stalls by yyDing1 · Pull Request #63 · verl-project/uni-agent

yyDing1 · 2026-06-15T10:08:08Z

Summary

Rework the per-run logging backend so it survives large-scale concurrent
training. Replaces the "one loguru sink per run" design with a single global
dispatch sink, fixes a sink/thread/fd leak, and stops per-record flushes from
stalling the writer thread when log_dir lives on HDFS/FUSE.

Problem

The previous implementation called logger.add(<file>) once per run:

O(num_runs) per message: every record fanned out across all live sinks,
each re-evaluating a per-run case_filter. Cost grew with the number of
concurrent runs.
Leak: each run added a sink + background thread + open file descriptor.
If cleanup_handlers wasn't called (and it wasn't, on the training path),
these accumulated until logging fell over / fds were exhausted.
HDFS stalls: the sink flushed every record. On an HDFS/FUSE-backed
log_dir, each flush is a synchronous remote write(); with a single shared
writer thread, one slow flush blocked logging for all runs.
DEBUG_MODE parsing treated "0"/"false"/"" as enabled (non-empty string
is truthy).

What changed

uni_agent/async_logging.py

Single dispatch sink: one logger.add(_dispatch, enqueue=True) routes each
record to its run's file in O(1) by run_id, backed by one background writer
thread.
Real cleanup: cleanup_handlers(run_id) closes the run's file object and
drops it from the registry (no more leaked sinks/threads/fds).
Block buffering, no per-line flush by default: lines accumulate in-process
and reach the OS ~every io.DEFAULT_BUFFER_SIZE or on close(), drastically
cutting HDFS/FUSE write() frequency. Set LOG_FLUSH_EACH_LINE=1 for
near-real-time tailing on local disks.
Narrowed lock scope: the lock only guards the registry lookup; file I/O
runs outside it, so a slow write can't block run registration/cleanup. The
write-after-close race is handled by catching ValueError/OSError.
Fixed DEBUG_MODE parsing via explicit truthy-value parsing.
uni_agent/agent_loop.py
Call cleanup_handlers(self.run_id) in UniAgentLoop.run()'s finally block
so the per-run file is closed on the training path (success and failure).

…urrency Replace the "one loguru sink per run" design with a single global dispatch sink that routes each record to its run's file in O(1) by run_id, backed by one background writer thread. Previously every record fanned out across all live sinks (O(num_runs) filtering per message) and each run leaked a sink + thread + file descriptor when not cleaned up, which is the main reason logging fell over under large-scale concurrent training. Also call cleanup_handlers(run_id) in UniAgentLoop.run()'s finally block so the per-run file is closed on the training path, and fix DEBUG_MODE parsing so values like "0"/"false" are no longer treated as enabled. Co-authored-by: Cursor <cursoragent@cursor.com>

gemini-code-assist

Code Review

This pull request refactors the logging system to use a single global Loguru sink with asynchronous queueing and custom dispatching to run-specific files, replacing the previous dynamic handler registry. In the agent loop, a cleanup call is added to close these files when a run finishes. The review feedback identifies a critical issue where closing the file descriptor immediately in cleanup_handlers can cause pending asynchronous log messages to be lost. To resolve this, it is recommended to make cleanup_handlers asynchronous and await logger.complete() before closing the file, and consequently await cleanup_handlers in the agent loop's finally block.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-15T10:09:28Z

+def cleanup_handlers(run_id: str) -> None:
+    with _lock:
+        entry = _run_files.pop(run_id, None)
+    if entry is not None:
+        try:
+            entry[0].close()
+        except OSError:
+            pass


Since the global loguru sink is configured with enqueue=True, log messages are processed asynchronously in a background thread. If cleanup_handlers immediately pops and closes the file object, any pending log messages still in the queue for this run_id will be silently discarded because the file will either be closed or removed from _run_files before the background thread can process them.

To prevent losing critical trailing logs (such as final metrics, status, or error details), cleanup_handlers should be made asynchronous and await logger.complete() to ensure all enqueued log messages are fully flushed and written before the file is closed.

Suggested change

def cleanup_handlers(run_id: str) -> None:

with _lock:

entry = _run_files.pop(run_id, None)

if entry is not None:

try:

entry[0].close()

except OSError:

pass

async def cleanup_handlers(run_id: str) -> None:

await logger.complete()

with _lock:

entry = _run_files.pop(run_id, None)

if entry is not None:

try:

entry[0].close()

except OSError:

pass

gemini-code-assist · 2026-06-15T10:09:28Z

                output = await self._build_empty_agent_output(exit_reason="agent_loop_failed")
            finally:
                await self.env.close()
+                cleanup_handlers(self.run_id)


Since cleanup_handlers needs to be asynchronous to await the flushing of enqueued log messages, we must await its call here in the finally block.

Suggested change

cleanup_handlers(self.run_id)

await cleanup_handlers(self.run_id)

yyDing1 and others added 2 commits June 15, 2026 17:26

update

1e26ab8

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

yyDing1 merged commit 0754fac into main Jun 15, 2026
3 checks passed

yyDing1 deleted the yy/fix-async-logging branch June 15, 2026 10:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] fix: replace per-run log sinks with single dispatch sink to fix leaks and HDFS stalls#63

[core] fix: replace per-run log sinks with single dispatch sink to fix leaks and HDFS stalls#63
yyDing1 merged 2 commits into
mainfrom
yy/fix-async-logging

yyDing1 commented Jun 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	cleanup_handlers(self.run_id)
	await cleanup_handlers(self.run_id)

Uh oh!

Conversation

yyDing1 commented Jun 15, 2026

Summary

Problem

What changed

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant