Skip to content

[Bug] Tune driver RSS grows linearly with trial count due to orphaned logger state#64335

Open
goingforstudying-ctrl wants to merge 4 commits into
ray-project:masterfrom
goingforstudying-ctrl:fix/tune-memory-leak-64231
Open

[Bug] Tune driver RSS grows linearly with trial count due to orphaned logger state#64335
goingforstudying-ctrl wants to merge 4 commits into
ray-project:masterfrom
goingforstudying-ctrl:fix/tune-memory-leak-64231

Conversation

@goingforstudying-ctrl

Copy link
Copy Markdown

Was running a large hyperparam sweep with tune.run() and noticed the driver process RSS just kept climbing. After a few thousand trials it was eating multiple GBs and eventually got killed by the OOM killer.

I dug into the per-trial accumulators and found three separate leaks layered on top of each other:

  1. CSVLoggerCallback keeps a _trial_continue dict to know whether to write a CSV header. The entry is never removed when the trial ends, so it grows linearly with the number of trials. Same issue in JsonLoggerCallback with _trial_configs.

  2. TuneController queues scheduler decisions in _queued_trial_decisions, but if a trial stops or fails outside the normal result path (e.g. via stop_trial() or an exception), the queued entry never gets popped by _maybe_execute_queued_decision. It just sits there forever.

  3. Trial.get_json_state() serializes trial state with indent=2 for pretty printing. This is the largest per-trial blob — it's cached in memory and rewritten to the experiment-state checkpoint file every period. Dropping the indentation saves about 25-30% on both the retained footprint and the on-disk experiment state. Since this JSON is machine-read anyway, the pretty formatting is just waste.

The first two are straightforward — just clean up the dict entries in log_trial_end() and _schedule_trial_stop(). The third is a one-line change to remove indent=2 from json.dumps.

I also added a test file (test_logger_memory_leak.py) that spins up a bunch of mock trials, runs them through the loggers, and asserts the internal dicts are empty afterwards. Caught all three issues during development.

Not 100% sure if removing indent=2 breaks any downstream tools that parse the experiment state checkpoint as plain text. From what I can tell it's always loaded back via json.loads, so it should be fine. Let me know if there's a tool I'm missing.

This is the same thing reported in #64231.

@goingforstudying-ctrl goingforstudying-ctrl requested a review from a team as a code owner June 25, 2026 06:30

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses memory leaks in Ray Tune when running thousands of trials by cleaning up trial-related state in the controller and logger callbacks upon trial completion, and by using compact JSON serialization for trial state checkpoints. The feedback suggests extending the controller cleanup to include _cached_trial_decisions when a trial is not saving, and further optimizing the JSON serialization by using compact separators to minimize memory footprint.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +1372 to +1374
# Clean up any queued decision for this trial to prevent memory leak
# when trials stop or fail outside the normal result path (see #64231)
self._queued_trial_decisions.pop(trial.trial_id, None)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In addition to cleaning up _queued_trial_decisions, we should also clean up _cached_trial_decisions when the trial is not saving. If a trial is stopped or fails outside the normal path while not saving, any entry in _cached_trial_decisions will be orphaned and leak memory, as the trial actor is terminated and no further results or saves will be processed to trigger its cleanup.

Suggested change
# Clean up any queued decision for this trial to prevent memory leak
# when trials stop or fail outside the normal result path (see #64231)
self._queued_trial_decisions.pop(trial.trial_id, None)
# Clean up any queued or cached decision for this trial to prevent memory leak
# when trials stop or fail outside the normal result path (see #64231)
self._queued_trial_decisions.pop(trial.trial_id, None)
if not trial.is_saving:
self._cached_trial_decisions.pop(trial.trial_id, None)

@pseudo-rnd-thoughts pseudo-rnd-thoughts Jun 26, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goingforstudying-ctrl Is gemini right for

In addition to cleaning up _queued_trial_decisions, we should also clean up _cached_trial_decisions when the trial is not saving. If a trial is stopped or fails outside the normal path while not saving, any entry in _cached_trial_decisions will be orphaned and leak memory, as the trial actor is terminated and no further results or saves will be processed to trigger its cleanup.

Comment thread python/ray/tune/experiment/trial.py
@ray-gardener ray-gardener Bot added tune Tune-related issues community-contribution Contributed by the community labels Jun 25, 2026
@goingforstudying-ctrl goingforstudying-ctrl force-pushed the fix/tune-memory-leak-64231 branch 8 times, most recently from a39fe09 to 0694bac Compare June 25, 2026 19:43

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 0694bac. Configure here.

Comment thread python/ray/tune/tests/test_logger_memory_leak.py Outdated
@goingforstudying-ctrl goingforstudying-ctrl force-pushed the fix/tune-memory-leak-64231 branch 5 times, most recently from 88231ec to f4bebe5 Compare June 26, 2026 06:21

@pseudo-rnd-thoughts pseudo-rnd-thoughts left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, it overall looks good. Do you have a minimal example script to reproduce the original issue? Do you have a graph of the memory usage over time with the Ray 2.55 and this PR?

Please reduce the length of comments that are LLM generated. For the test added, how long does it take to run?

Comment on lines +1372 to +1374
# Clean up any queued decision for this trial to prevent memory leak
# when trials stop or fail outside the normal result path (see #64231)
self._queued_trial_decisions.pop(trial.trial_id, None)

@pseudo-rnd-thoughts pseudo-rnd-thoughts Jun 26, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goingforstudying-ctrl Is gemini right for

In addition to cleaning up _queued_trial_decisions, we should also clean up _cached_trial_decisions when the trial is not saving. If a trial is stopped or fails outside the normal path while not saving, any entry in _cached_trial_decisions will be orphaned and leak memory, as the trial actor is terminated and no further results or saves will be processed to trigger its cleanup.

@goingforstudying-ctrl goingforstudying-ctrl force-pushed the fix/tune-memory-leak-64231 branch 7 times, most recently from ab5f3f3 to ab91b22 Compare June 27, 2026 01:35
@goingforstudying-ctrl goingforstudying-ctrl force-pushed the fix/tune-memory-leak-64231 branch 2 times, most recently from a6c77e4 to b5468dd Compare June 27, 2026 17:13
- CSVLoggerCallback now cleans up _trial_continue on log_trial_end
- JsonLoggerCallback now cleans up _trial_configs on log_trial_end
- TuneController._schedule_trial_stop now removes orphaned queued decisions
- Trial.get_json_state uses compact JSON to reduce checkpoint memory ~25-30%
- Adds test_logger_memory_leak.py to verify cleanup behavior

see ray-project#64231

Signed-off-by: goingforstudying-ctrl <goingforstudying-ctrl@users.noreply.github.com>
The stub init_local_path did nothing, but tests log under nested paths
like test_dir/trial_0. Real trials create those directories before
opening progress.csv or param files. Have the mock create its logdir
so FileNotFoundError does not surface during log_trial_result.

Signed-off-by: goingforstudying-ctrl <goingforstudying-ctrl@users.noreply.github.com>
Also remove cached decisions in _schedule_trial_stop to prevent
memory leaks when trials stop or fail outside the normal save path.

Addresses review feedback about orphaned _cached_trial_decisions
entries when a trial is stopped or errors while not saving.

Signed-off-by: goingforstudying-ctrl <goingforstudying-ctrl@users.noreply.github.com>
@goingforstudying-ctrl goingforstudying-ctrl force-pushed the fix/tune-memory-leak-64231 branch from b5468dd to be756ba Compare June 27, 2026 21:01
This script demonstrates the three memory leaks fixed in this PR:
1. CSVLoggerCallback._trial_continue dict grows without cleanup
2. JsonLoggerCallback._trial_configs dict grows without cleanup
3. TuneController._queued_trial_decisions dict grows when trials stop
   outside the normal result path

Run with: python python/ray/tune/tests/repro_memory_leak.py

Signed-off-by: goingforstudying-ctrl <goingforstudying-ctrl@users.noreply.github.com>
@goingforstudying-ctrl goingforstudying-ctrl force-pushed the fix/tune-memory-leak-64231 branch from ca6c655 to af8c1e3 Compare June 28, 2026 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community tune Tune-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants