Skip to content

process_validation_metrics() crashes on None-filled sparse reward_extra_info keys (metrics emitted for only some samples) #6830

Description

@giladfrid009

System Info

Summary

When a reward_extra_info key is emitted for only some validation samples (a sparse key), main_ppo_sync.PPOTrainer._validate null-fills it to a uniform schema, but process_validation_metrics then crashes computing np.mean([..., None, ...]) over the per-uid group. Any reward function that conditionally adds an extra-info key makes validation crash as soon as one sample omits it.

Environment

  • verl 0.8.0 (I confirmed this is an issue in the latest repo version as well, except some files moved around)
  • Trainer: verl.trainer.main_ppo_sync.PPOTrainer (sync / TransferQueue path)
  • trainer.validation_data_dir set (validation generation dump enabled)

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Root cause

The two halves are individually reasonable but incompatible:

  1. Null-fill on dump. _validate pads sparse keys with None so every sample shares one schema:

    # main_ppo_sync.py (~L939-945)
    for key in reward_extra_infos_dict:
        if key != "reward" and key not in reward_extra_info:
            reward_extra_infos_dict[key].append(None)
    for key, value in reward_extra_info.items():
        if key not in reward_extra_infos_dict:
            reward_extra_infos_dict[key] = [None] * n_prior
        reward_extra_infos_dict[key].append(value)
  2. Aggregation can't tolerate that None. _val_metrics_update -> process_validation_metrics groups values per uid and means each group, but its skip-filter only catches empty lists and strings:

    # metric_utils.py (~L638-646)
    for var_name, var_vals in var2vals.items():
        if not var_vals or isinstance(var_vals[0], str):   # does NOT skip [None]
            continue
        n_resps = len(var_vals)
        metric = {f"mean@{n_resps}": float(np_mean(var_vals))}   # np.mean([None]) -> TypeError

A uid whose only rollout omitted the key has var_vals == [None], so:

TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

Reproduction

A reward function that emits an extra-info key for only some samples:

reward_extra_info = {"reward": r, "parse_success": ok}
if ok:
    reward_extra_info["cp"] = engine_eval   # absent when ok is False

Run validation with at least one sample where ok is False. The dump ({step}.jsonl) is written fine (the missing cells show null), then _val_metrics_update raises.

Minimal trigger for the aggregation bug itself:

import numpy as np
np.mean([None])          # TypeError
np.mean([35.0, None])    # TypeError

Why current workarounds are unsatisfying

  • A numeric sentinel for the missing case skews the mean; nan propagates and turns the whole aggregate into nan (verl uses np.mean, not np.nanmean).
  • Forcing every sample to emit every key defeats the purpose of optional/diagnostic metrics, and the codebase clearly intends sparse keys, since _validate deliberately null-fills them for the dump.

Reproduction Code

from verl.trainer.ppo.metric_utils import process_validation_metrics

# Three validation prompts, one rollout each (n=1). Two parse-success samples that
# emitted a `cp` metric, and one parse-fail sample that did NOT emit `cp`.
data_sources = ["chess", "chess", "chess"]
sample_uids = ["uid_a", "uid_b", "uid_c"]

# This is exactly what `_validate` hands to `process_validation_metrics` AFTER its
# null-fill (main_ppo_sync.py ~L939-945): every key spans all samples, and the
# sample that never emitted `cp` is padded with None.
infos_dict = {
    "reward": [0.9, 0.1, 0.0],   # always present
    "cp":     [120, -30, None],  # sparse: None for the parse-fail sample (verl-inserted)
}

print("infos_dict =", infos_dict)
print("Calling process_validation_metrics ...\n")

result = process_validation_metrics(data_sources, sample_uids, infos_dict)

# If verl is fixed, we reach here. The `cp` mean should be computed over the
# samples that actually emitted it (120 and -30 -> mean 45), ignoring None.
print("No crash. Result:")
for ds, var2metric in result.items():
    for var, metrics in var2metric.items():
        print(f"  {ds}/{var}: {metrics}")

Expected behavior

Suggested Fix

Make process_validation_metrics treat None as "absent" instead of a value, e.g. drop None entries per group before the mean (and skip groups that become empty):

var_vals = [v for v in var_vals if v is not None]
if not var_vals or isinstance(var_vals[0], str):
    continue

This keeps the dump behavior (sparse keys still appear with null cells) while computing each key's mean only over the samples that actually emitted it. Alternatively, gate the null-fill so sparse keys are excluded from metric aggregation entirely.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions