Skip to content

feat(code_gen): add per-language subset metrics for livecodebench-x#1719

Open
rodboev wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
rodboev:pr/code-gen-livecodebench-x-per-lang
Open

feat(code_gen): add per-language subset metrics for livecodebench-x#1719
rodboev wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
rodboev:pr/code-gen-livecodebench-x-per-lang

Conversation

@rodboev

@rodboev rodboev commented Jun 25, 2026

Copy link
Copy Markdown

compute_metrics() currently groups by difficulty only. livecodebench-x benchmark rows include a target_language field (de, es, fr, ja) as a top-level data row field (adjacent to verifier_metadata, not inside it), but the field was never declared on CompCodingVerifyRequest — Pydantic silently dropped it — and therefore never reached CompCodingVerifyResponse or compute_subset_metrics. As a result, per-language pass@k values never appear in rollouts_aggregate_metrics.json.

This PR fixes that by mirroring the existing difficulty pattern:

  • Adds target_language: Optional[str] = None to CompCodingVerifyRequest so Pydantic preserves the top-level field through routing
  • Adds target_language: Optional[str] = None to CompCodingVerifyResponse
  • Reads body.target_language in verify() and propagates it through all three return sites via **body.model_dump()
  • Calls compute_subset_metrics(tasks, "target_language", ...) alongside the existing difficulty call
  • Confirms get_key_metrics() already handles the new prefixes generically — no change needed there
  • Adds focused tests for the new metric keys and field propagation

Closes #1171

Checklist

  • Targeted tests pass: uv run pytest resources_servers/code_gen/tests/test_app.py -x -v
  • No changes to benchmark data files, config YAML, or nemo_gym/reward_profile.py
  • DCO sign-off on all commits

rodboev added 3 commits June 24, 2026 20:49
Signed-off-by: Rod Boev <rod.boev@gmail.com>
…correctly

Signed-off-by: Rod Boev <rod.boev@gmail.com>
…_dump

Signed-off-by: Rod Boev <rod.boev@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 25, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nemo-automation-bot nemo-automation-bot Bot added the community-request Issue reported or requested by someone from the community label Jun 25, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request Issue reported or requested by someone from the community waiting-on-maintainers Waiting on maintainers to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Per-language metric breakdown is not produced for livecodebench-x by code_gen

2 participants