[Bug] Tune driver RSS grows linearly with trial count due to orphaned logger state by goingforstudying-ctrl · Pull Request #64335 · ray-project/ray

goingforstudying-ctrl · 2026-06-25T06:30:21Z

Was running a large hyperparam sweep with tune.run() and noticed the driver process RSS just kept climbing. After a few thousand trials it was eating multiple GBs and eventually got killed by the OOM killer.

I dug into the per-trial accumulators and found three separate leaks layered on top of each other:

CSVLoggerCallback keeps a _trial_continue dict to know whether to write a CSV header. The entry is never removed when the trial ends, so it grows linearly with the number of trials. Same issue in JsonLoggerCallback with _trial_configs.
TuneController queues scheduler decisions in _queued_trial_decisions, but if a trial stops or fails outside the normal result path (e.g. via stop_trial() or an exception), the queued entry never gets popped by _maybe_execute_queued_decision. It just sits there forever.
Trial.get_json_state() serializes trial state with indent=2 for pretty printing. This is the largest per-trial blob — it's cached in memory and rewritten to the experiment-state checkpoint file every period. Dropping the indentation saves about 25-30% on both the retained footprint and the on-disk experiment state. Since this JSON is machine-read anyway, the pretty formatting is just waste.

The first two are straightforward — just clean up the dict entries in log_trial_end() and _schedule_trial_stop(). The third is a one-line change to remove indent=2 from json.dumps.

I also added a test file (test_logger_memory_leak.py) that spins up a bunch of mock trials, runs them through the loggers, and asserts the internal dicts are empty afterwards. Caught all three issues during development.

Not 100% sure if removing indent=2 breaks any downstream tools that parse the experiment state checkpoint as plain text. From what I can tell it's always loaded back via json.loads, so it should be fine. Let me know if there's a tool I'm missing.

This is the same thing reported in #64231.

gemini-code-assist

Code Review

This pull request addresses memory leaks in Ray Tune when running thousands of trials by cleaning up trial-related state in the controller and logger callbacks upon trial completion, and by using compact JSON serialization for trial state checkpoints. The feedback suggests extending the controller cleanup to include _cached_trial_decisions when a trial is not saving, and further optimizing the JSON serialization by using compact separators to minimize memory footprint.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-25T06:31:48Z

+        # Clean up any queued decision for this trial to prevent memory leak
+        # when trials stop or fail outside the normal result path (see #64231)
+        self._queued_trial_decisions.pop(trial.trial_id, None)


In addition to cleaning up _queued_trial_decisions, we should also clean up _cached_trial_decisions when the trial is not saving. If a trial is stopped or fails outside the normal path while not saving, any entry in _cached_trial_decisions will be orphaned and leak memory, as the trial actor is terminated and no further results or saves will be processed to trigger its cleanup.

Suggested change

# Clean up any queued decision for this trial to prevent memory leak

# when trials stop or fail outside the normal result path (see #64231)

self._queued_trial_decisions.pop(trial.trial_id, None)

# Clean up any queued or cached decision for this trial to prevent memory leak

# when trials stop or fail outside the normal result path (see #64231)

self._queued_trial_decisions.pop(trial.trial_id, None)

if not trial.is_saving:

self._cached_trial_decisions.pop(trial.trial_id, None)

@goingforstudying-ctrl Is gemini right for

In addition to cleaning up _queued_trial_decisions, we should also clean up _cached_trial_decisions when the trial is not saving. If a trial is stopped or fails outside the normal path while not saving, any entry in _cached_trial_decisions will be orphaned and leak memory, as the trial actor is terminated and no further results or saves will be processed to trigger its cleanup.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 0694bac. Configure here.}

pseudo-rnd-thoughts

Thanks for the changes, it overall looks good. Do you have a minimal example script to reproduce the original issue? Do you have a graph of the memory usage over time with the Ray 2.55 and this PR?

Please reduce the length of comments that are LLM generated. For the test added, how long does it take to run?

pseudo-rnd-thoughts · 2026-06-26T08:39:57Z

+        # Clean up any queued decision for this trial to prevent memory leak
+        # when trials stop or fail outside the normal result path (see #64231)
+        self._queued_trial_decisions.pop(trial.trial_id, None)


@goingforstudying-ctrl Is gemini right for

In addition to cleaning up _queued_trial_decisions, we should also clean up _cached_trial_decisions when the trial is not saving. If a trial is stopped or fails outside the normal path while not saving, any entry in _cached_trial_decisions will be orphaned and leak memory, as the trial actor is terminated and no further results or saves will be processed to trigger its cleanup.

- CSVLoggerCallback now cleans up _trial_continue on log_trial_end - JsonLoggerCallback now cleans up _trial_configs on log_trial_end - TuneController._schedule_trial_stop now removes orphaned queued decisions - Trial.get_json_state uses compact JSON to reduce checkpoint memory ~25-30% - Adds test_logger_memory_leak.py to verify cleanup behavior see ray-project#64231 Signed-off-by: goingforstudying-ctrl <goingforstudying-ctrl@users.noreply.github.com>

The stub init_local_path did nothing, but tests log under nested paths like test_dir/trial_0. Real trials create those directories before opening progress.csv or param files. Have the mock create its logdir so FileNotFoundError does not surface during log_trial_result. Signed-off-by: goingforstudying-ctrl <goingforstudying-ctrl@users.noreply.github.com>

Also remove cached decisions in _schedule_trial_stop to prevent memory leaks when trials stop or fail outside the normal save path. Addresses review feedback about orphaned _cached_trial_decisions entries when a trial is stopped or errors while not saving. Signed-off-by: goingforstudying-ctrl <goingforstudying-ctrl@users.noreply.github.com>

This script demonstrates the three memory leaks fixed in this PR: 1. CSVLoggerCallback._trial_continue dict grows without cleanup 2. JsonLoggerCallback._trial_configs dict grows without cleanup 3. TuneController._queued_trial_decisions dict grows when trials stop outside the normal result path Run with: python python/ray/tune/tests/repro_memory_leak.py Signed-off-by: goingforstudying-ctrl <goingforstudying-ctrl@users.noreply.github.com>

goingforstudying-ctrl requested a review from a team as a code owner June 25, 2026 06:30

gemini-code-assist Bot reviewed Jun 25, 2026

View reviewed changes

ray-gardener Bot added tune Tune-related issues community-contribution Contributed by the community labels Jun 25, 2026

goingforstudying-ctrl force-pushed the fix/tune-memory-leak-64231 branch 8 times, most recently from a39fe09 to 0694bac Compare June 25, 2026 19:43

cursor Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread python/ray/tune/tests/test_logger_memory_leak.py Outdated

goingforstudying-ctrl force-pushed the fix/tune-memory-leak-64231 branch 5 times, most recently from 88231ec to f4bebe5 Compare June 26, 2026 06:21

pseudo-rnd-thoughts reviewed Jun 26, 2026

View reviewed changes

goingforstudying-ctrl force-pushed the fix/tune-memory-leak-64231 branch 7 times, most recently from ab5f3f3 to ab91b22 Compare June 27, 2026 01:35

zhangsikai123 mentioned this pull request Jun 27, 2026

[Tune] Ray Tune memory leak: RSS grows linearly with trial count in tune.run() #64231

Open

goingforstudying-ctrl force-pushed the fix/tune-memory-leak-64231 branch 2 times, most recently from a6c77e4 to b5468dd Compare June 27, 2026 17:13

goingforstudying-ctrl added 2 commits June 27, 2026 17:01

goingforstudying-ctrl force-pushed the fix/tune-memory-leak-64231 branch from b5468dd to be756ba Compare June 27, 2026 21:01

goingforstudying-ctrl force-pushed the fix/tune-memory-leak-64231 branch from ca6c655 to af8c1e3 Compare June 28, 2026 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Tune driver RSS grows linearly with trial count due to orphaned logger state#64335

[Bug] Tune driver RSS grows linearly with trial count due to orphaned logger state#64335
goingforstudying-ctrl wants to merge 4 commits into
ray-project:masterfrom
goingforstudying-ctrl:fix/tune-memory-leak-64231

goingforstudying-ctrl commented Jun 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Uh oh!

pseudo-rnd-thoughts Jun 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

pseudo-rnd-thoughts left a comment

Uh oh!

pseudo-rnd-thoughts Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

goingforstudying-ctrl commented Jun 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

pseudo-rnd-thoughts Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pseudo-rnd-thoughts left a comment

Choose a reason for hiding this comment

Uh oh!

pseudo-rnd-thoughts Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pseudo-rnd-thoughts Jun 26, 2026 •

edited

Loading

pseudo-rnd-thoughts Jun 26, 2026 •

edited

Loading