`process_validation_metrics()` crashes on None-filled sparse `reward_extra_info` keys (metrics emitted for only some samples)

### System Info

### Summary

When a `reward_extra_info` key is emitted for only *some* validation samples (a sparse key), `main_ppo_sync.PPOTrainer._validate` null-fills it to a uniform schema, but `process_validation_metrics` then crashes computing `np.mean([..., None, ...])` over the per-uid group. Any reward function that conditionally adds an extra-info key makes validation crash as soon as one sample omits it.

### Environment

- verl 0.8.0 (I confirmed this is an issue in the latest repo version as well, except some files moved around)
- Trainer: `verl.trainer.main_ppo_sync.PPOTrainer` (sync / TransferQueue path)
- `trainer.validation_data_dir` set (validation generation dump enabled)

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

### Root cause

The two halves are individually reasonable but incompatible:

1. **Null-fill on dump.** `_validate` pads sparse keys with `None` so every sample shares one schema:

   ```python
   # main_ppo_sync.py (~L939-945)
   for key in reward_extra_infos_dict:
       if key != "reward" and key not in reward_extra_info:
           reward_extra_infos_dict[key].append(None)
   for key, value in reward_extra_info.items():
       if key not in reward_extra_infos_dict:
           reward_extra_infos_dict[key] = [None] * n_prior
       reward_extra_infos_dict[key].append(value)
   ```

2. **Aggregation can't tolerate that `None`.** `_val_metrics_update` -> `process_validation_metrics` groups values per uid and means each group, but its skip-filter only catches empty lists and strings:

   ```python
   # metric_utils.py (~L638-646)
   for var_name, var_vals in var2vals.items():
       if not var_vals or isinstance(var_vals[0], str):   # does NOT skip [None]
           continue
       n_resps = len(var_vals)
       metric = {f"mean@{n_resps}": float(np_mean(var_vals))}   # np.mean([None]) -> TypeError
   ```

A uid whose only rollout omitted the key has `var_vals == [None]`, so:

```
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
```

### Reproduction

A reward function that emits an extra-info key for only some samples:

```python
reward_extra_info = {"reward": r, "parse_success": ok}
if ok:
    reward_extra_info["cp"] = engine_eval   # absent when ok is False
```

Run validation with at least one sample where `ok` is `False`. The dump (`{step}.jsonl`) is written fine (the missing cells show `null`), then `_val_metrics_update` raises.

Minimal trigger for the aggregation bug itself:

```python
import numpy as np
np.mean([None])          # TypeError
np.mean([35.0, None])    # TypeError
```

### Why current workarounds are unsatisfying

- A numeric sentinel for the missing case skews the mean; `nan` propagates and turns the whole aggregate into `nan` (verl uses `np.mean`, not `np.nanmean`).
- Forcing every sample to emit every key defeats the purpose of optional/diagnostic metrics, and the codebase clearly intends sparse keys, since `_validate` deliberately null-fills them for the dump.

## Reproduction Code

```python
from verl.trainer.ppo.metric_utils import process_validation_metrics

# Three validation prompts, one rollout each (n=1). Two parse-success samples that
# emitted a `cp` metric, and one parse-fail sample that did NOT emit `cp`.
data_sources = ["chess", "chess", "chess"]
sample_uids = ["uid_a", "uid_b", "uid_c"]

# This is exactly what `_validate` hands to `process_validation_metrics` AFTER its
# null-fill (main_ppo_sync.py ~L939-945): every key spans all samples, and the
# sample that never emitted `cp` is padded with None.
infos_dict = {
    "reward": [0.9, 0.1, 0.0],   # always present
    "cp":     [120, -30, None],  # sparse: None for the parse-fail sample (verl-inserted)
}

print("infos_dict =", infos_dict)
print("Calling process_validation_metrics ...\n")

result = process_validation_metrics(data_sources, sample_uids, infos_dict)

# If verl is fixed, we reach here. The `cp` mean should be computed over the
# samples that actually emitted it (120 and -30 -> mean 45), ignoring None.
print("No crash. Result:")
for ds, var2metric in result.items():
    for var, metrics in var2metric.items():
        print(f"  {ds}/{var}: {metrics}")

```

### Expected behavior

### Suggested Fix

Make `process_validation_metrics` treat `None` as "absent" instead of a value, e.g. drop `None` entries per group before the mean (and skip groups that become empty):

```python
var_vals = [v for v in var_vals if v is not None]
if not var_vals or isinstance(var_vals[0], str):
    continue
```

This keeps the dump behavior (sparse keys still appear with `null` cells) while computing each key's mean only over the samples that actually emitted it. Alternatively, gate the null-fill so sparse keys are excluded from metric aggregation entirely.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`process_validation_metrics()` crashes on None-filled sparse `reward_extra_info` keys (metrics emitted for only some samples) #6830

System Info

Summary

Environment

Information

Tasks

Reproduction

Root cause

Reproduction

Why current workarounds are unsatisfying

Reproduction Code

Expected behavior

Suggested Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

process_validation_metrics() crashes on None-filled sparse reward_extra_info keys (metrics emitted for only some samples) #6830

Description

System Info

Summary

Environment

Information

Tasks

Reproduction

Root cause

Reproduction

Why current workarounds are unsatisfying

Reproduction Code

Expected behavior

Suggested Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`process_validation_metrics()` crashes on None-filled sparse `reward_extra_info` keys (metrics emitted for only some samples) #6830