[Bug] Tune driver RSS grows linearly with trial count due to orphaned logger state#64335
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses memory leaks in Ray Tune when running thousands of trials by cleaning up trial-related state in the controller and logger callbacks upon trial completion, and by using compact JSON serialization for trial state checkpoints. The feedback suggests extending the controller cleanup to include _cached_trial_decisions when a trial is not saving, and further optimizing the JSON serialization by using compact separators to minimize memory footprint.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| # Clean up any queued decision for this trial to prevent memory leak | ||
| # when trials stop or fail outside the normal result path (see #64231) | ||
| self._queued_trial_decisions.pop(trial.trial_id, None) |
There was a problem hiding this comment.
In addition to cleaning up _queued_trial_decisions, we should also clean up _cached_trial_decisions when the trial is not saving. If a trial is stopped or fails outside the normal path while not saving, any entry in _cached_trial_decisions will be orphaned and leak memory, as the trial actor is terminated and no further results or saves will be processed to trigger its cleanup.
| # Clean up any queued decision for this trial to prevent memory leak | |
| # when trials stop or fail outside the normal result path (see #64231) | |
| self._queued_trial_decisions.pop(trial.trial_id, None) | |
| # Clean up any queued or cached decision for this trial to prevent memory leak | |
| # when trials stop or fail outside the normal result path (see #64231) | |
| self._queued_trial_decisions.pop(trial.trial_id, None) | |
| if not trial.is_saving: | |
| self._cached_trial_decisions.pop(trial.trial_id, None) |
There was a problem hiding this comment.
@goingforstudying-ctrl Is gemini right for
In addition to cleaning up _queued_trial_decisions, we should also clean up _cached_trial_decisions when the trial is not saving. If a trial is stopped or fails outside the normal path while not saving, any entry in _cached_trial_decisions will be orphaned and leak memory, as the trial actor is terminated and no further results or saves will be processed to trigger its cleanup.
a39fe09 to
0694bac
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 0694bac. Configure here.
88231ec to
f4bebe5
Compare
pseudo-rnd-thoughts
left a comment
There was a problem hiding this comment.
Thanks for the changes, it overall looks good. Do you have a minimal example script to reproduce the original issue? Do you have a graph of the memory usage over time with the Ray 2.55 and this PR?
Please reduce the length of comments that are LLM generated. For the test added, how long does it take to run?
| # Clean up any queued decision for this trial to prevent memory leak | ||
| # when trials stop or fail outside the normal result path (see #64231) | ||
| self._queued_trial_decisions.pop(trial.trial_id, None) |
There was a problem hiding this comment.
@goingforstudying-ctrl Is gemini right for
In addition to cleaning up _queued_trial_decisions, we should also clean up _cached_trial_decisions when the trial is not saving. If a trial is stopped or fails outside the normal path while not saving, any entry in _cached_trial_decisions will be orphaned and leak memory, as the trial actor is terminated and no further results or saves will be processed to trigger its cleanup.
ab5f3f3 to
ab91b22
Compare
a6c77e4 to
b5468dd
Compare
- CSVLoggerCallback now cleans up _trial_continue on log_trial_end - JsonLoggerCallback now cleans up _trial_configs on log_trial_end - TuneController._schedule_trial_stop now removes orphaned queued decisions - Trial.get_json_state uses compact JSON to reduce checkpoint memory ~25-30% - Adds test_logger_memory_leak.py to verify cleanup behavior see ray-project#64231 Signed-off-by: goingforstudying-ctrl <goingforstudying-ctrl@users.noreply.github.com>
The stub init_local_path did nothing, but tests log under nested paths like test_dir/trial_0. Real trials create those directories before opening progress.csv or param files. Have the mock create its logdir so FileNotFoundError does not surface during log_trial_result. Signed-off-by: goingforstudying-ctrl <goingforstudying-ctrl@users.noreply.github.com>
Also remove cached decisions in _schedule_trial_stop to prevent memory leaks when trials stop or fail outside the normal save path. Addresses review feedback about orphaned _cached_trial_decisions entries when a trial is stopped or errors while not saving. Signed-off-by: goingforstudying-ctrl <goingforstudying-ctrl@users.noreply.github.com>
b5468dd to
be756ba
Compare
This script demonstrates the three memory leaks fixed in this PR: 1. CSVLoggerCallback._trial_continue dict grows without cleanup 2. JsonLoggerCallback._trial_configs dict grows without cleanup 3. TuneController._queued_trial_decisions dict grows when trials stop outside the normal result path Run with: python python/ray/tune/tests/repro_memory_leak.py Signed-off-by: goingforstudying-ctrl <goingforstudying-ctrl@users.noreply.github.com>
ca6c655 to
af8c1e3
Compare

Was running a large hyperparam sweep with tune.run() and noticed the driver process RSS just kept climbing. After a few thousand trials it was eating multiple GBs and eventually got killed by the OOM killer.
I dug into the per-trial accumulators and found three separate leaks layered on top of each other:
CSVLoggerCallback keeps a
_trial_continuedict to know whether to write a CSV header. The entry is never removed when the trial ends, so it grows linearly with the number of trials. Same issue in JsonLoggerCallback with_trial_configs.TuneController queues scheduler decisions in
_queued_trial_decisions, but if a trial stops or fails outside the normal result path (e.g. viastop_trial()or an exception), the queued entry never gets popped by_maybe_execute_queued_decision. It just sits there forever.Trial.get_json_state() serializes trial state with
indent=2for pretty printing. This is the largest per-trial blob — it's cached in memory and rewritten to the experiment-state checkpoint file every period. Dropping the indentation saves about 25-30% on both the retained footprint and the on-disk experiment state. Since this JSON is machine-read anyway, the pretty formatting is just waste.The first two are straightforward — just clean up the dict entries in
log_trial_end()and_schedule_trial_stop(). The third is a one-line change to removeindent=2fromjson.dumps.I also added a test file (
test_logger_memory_leak.py) that spins up a bunch of mock trials, runs them through the loggers, and asserts the internal dicts are empty afterwards. Caught all three issues during development.Not 100% sure if removing
indent=2breaks any downstream tools that parse the experiment state checkpoint as plain text. From what I can tell it's always loaded back viajson.loads, so it should be fine. Let me know if there's a tool I'm missing.This is the same thing reported in #64231.