System Info
Summary
When a reward_extra_info key is emitted for only some validation samples (a sparse key), main_ppo_sync.PPOTrainer._validate null-fills it to a uniform schema, but process_validation_metrics then crashes computing np.mean([..., None, ...]) over the per-uid group. Any reward function that conditionally adds an extra-info key makes validation crash as soon as one sample omits it.
Environment
- verl 0.8.0 (I confirmed this is an issue in the latest repo version as well, except some files moved around)
- Trainer:
verl.trainer.main_ppo_sync.PPOTrainer (sync / TransferQueue path)
trainer.validation_data_dir set (validation generation dump enabled)
Information
Tasks
Reproduction
Root cause
The two halves are individually reasonable but incompatible:
-
Null-fill on dump. _validate pads sparse keys with None so every sample shares one schema:
# main_ppo_sync.py (~L939-945)
for key in reward_extra_infos_dict:
if key != "reward" and key not in reward_extra_info:
reward_extra_infos_dict[key].append(None)
for key, value in reward_extra_info.items():
if key not in reward_extra_infos_dict:
reward_extra_infos_dict[key] = [None] * n_prior
reward_extra_infos_dict[key].append(value)
-
Aggregation can't tolerate that None. _val_metrics_update -> process_validation_metrics groups values per uid and means each group, but its skip-filter only catches empty lists and strings:
# metric_utils.py (~L638-646)
for var_name, var_vals in var2vals.items():
if not var_vals or isinstance(var_vals[0], str): # does NOT skip [None]
continue
n_resps = len(var_vals)
metric = {f"mean@{n_resps}": float(np_mean(var_vals))} # np.mean([None]) -> TypeError
A uid whose only rollout omitted the key has var_vals == [None], so:
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
Reproduction
A reward function that emits an extra-info key for only some samples:
reward_extra_info = {"reward": r, "parse_success": ok}
if ok:
reward_extra_info["cp"] = engine_eval # absent when ok is False
Run validation with at least one sample where ok is False. The dump ({step}.jsonl) is written fine (the missing cells show null), then _val_metrics_update raises.
Minimal trigger for the aggregation bug itself:
import numpy as np
np.mean([None]) # TypeError
np.mean([35.0, None]) # TypeError
Why current workarounds are unsatisfying
- A numeric sentinel for the missing case skews the mean;
nan propagates and turns the whole aggregate into nan (verl uses np.mean, not np.nanmean).
- Forcing every sample to emit every key defeats the purpose of optional/diagnostic metrics, and the codebase clearly intends sparse keys, since
_validate deliberately null-fills them for the dump.
Reproduction Code
from verl.trainer.ppo.metric_utils import process_validation_metrics
# Three validation prompts, one rollout each (n=1). Two parse-success samples that
# emitted a `cp` metric, and one parse-fail sample that did NOT emit `cp`.
data_sources = ["chess", "chess", "chess"]
sample_uids = ["uid_a", "uid_b", "uid_c"]
# This is exactly what `_validate` hands to `process_validation_metrics` AFTER its
# null-fill (main_ppo_sync.py ~L939-945): every key spans all samples, and the
# sample that never emitted `cp` is padded with None.
infos_dict = {
"reward": [0.9, 0.1, 0.0], # always present
"cp": [120, -30, None], # sparse: None for the parse-fail sample (verl-inserted)
}
print("infos_dict =", infos_dict)
print("Calling process_validation_metrics ...\n")
result = process_validation_metrics(data_sources, sample_uids, infos_dict)
# If verl is fixed, we reach here. The `cp` mean should be computed over the
# samples that actually emitted it (120 and -30 -> mean 45), ignoring None.
print("No crash. Result:")
for ds, var2metric in result.items():
for var, metrics in var2metric.items():
print(f" {ds}/{var}: {metrics}")
Expected behavior
Suggested Fix
Make process_validation_metrics treat None as "absent" instead of a value, e.g. drop None entries per group before the mean (and skip groups that become empty):
var_vals = [v for v in var_vals if v is not None]
if not var_vals or isinstance(var_vals[0], str):
continue
This keeps the dump behavior (sparse keys still appear with null cells) while computing each key's mean only over the samples that actually emitted it. Alternatively, gate the null-fill so sparse keys are excluded from metric aggregation entirely.
System Info
Summary
When a
reward_extra_infokey is emitted for only some validation samples (a sparse key),main_ppo_sync.PPOTrainer._validatenull-fills it to a uniform schema, butprocess_validation_metricsthen crashes computingnp.mean([..., None, ...])over the per-uid group. Any reward function that conditionally adds an extra-info key makes validation crash as soon as one sample omits it.Environment
verl.trainer.main_ppo_sync.PPOTrainer(sync / TransferQueue path)trainer.validation_data_dirset (validation generation dump enabled)Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Root cause
The two halves are individually reasonable but incompatible:
Null-fill on dump.
_validatepads sparse keys withNoneso every sample shares one schema:Aggregation can't tolerate that
None._val_metrics_update->process_validation_metricsgroups values per uid and means each group, but its skip-filter only catches empty lists and strings:A uid whose only rollout omitted the key has
var_vals == [None], so:Reproduction
A reward function that emits an extra-info key for only some samples:
Run validation with at least one sample where
okisFalse. The dump ({step}.jsonl) is written fine (the missing cells shownull), then_val_metrics_updateraises.Minimal trigger for the aggregation bug itself:
Why current workarounds are unsatisfying
nanpropagates and turns the whole aggregate intonan(verl usesnp.mean, notnp.nanmean)._validatedeliberately null-fills them for the dump.Reproduction Code
Expected behavior
Suggested Fix
Make
process_validation_metricstreatNoneas "absent" instead of a value, e.g. dropNoneentries per group before the mean (and skip groups that become empty):This keeps the dump behavior (sparse keys still appear with
nullcells) while computing each key's mean only over the samples that actually emitted it. Alternatively, gate the null-fill so sparse keys are excluded from metric aggregation entirely.