Skip to content

Add Korean benchmarks: KoBALT and CLIcK#1452

Open
bzantium wants to merge 5 commits into
NVIDIA-NeMo:mainfrom
bzantium:feat/kobalt-click
Open

Add Korean benchmarks: KoBALT and CLIcK#1452
bzantium wants to merge 5 commits into
NVIDIA-NeMo:mainfrom
bzantium:feat/kobalt-click

Conversation

@bzantium

@bzantium bzantium commented May 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds two more Korean evaluation benchmarks to the multilingual suite, following the same pattern as the six Korean benchmarks added in #1375.

  • KoBALT-700 (HF, paper): 700 expert-written 10-choice MCQ items targeting deep Korean linguistic competence across five domains (Syntax 300, Semantics 215, Pragmatics 81, Phonetics/Phonology 62, Morphology 42). The dataset ships each question as one string with the ten options embedded as `A: ... J: ...` lines; the prepare script splits these into the standard MCQ fields so the existing `eval_type=multichoice` evaluator can be reused. A new 10-choice prompt config (`eval/korean/mcq-10choices.yaml`) matches the `정답: A/B/.../J` answer format.

  • CLIcK (HF, paper, LREC-COLING 2024): 1995 items across 11 paper-defined subcategories under two broad domains (8 Culture: Economy / Geography / History / Law / Politics / Popular / Society / Tradition; 3 Language: Functional / Grammar / Textual). The HF mirror is a single flat split with no subcategory metadata, so the prepare script pulls the per-subcategory JSON files directly from the canonical GitHub repo and tags each item. Mixes 4- and 5-choice MCQ; the existing 5-choice prompt covers both.

Both benchmarks are single-language Korean and reuse the existing `multichoice` evaluator and per-sample `extract_regex` matching `정답: `.

Reference numbers

Validated end to end on three production models (`ns eval` on a single 4-GPU node, judge-free MCQ). Numbers also added to `docs/evaluation/multilingual.md`.

KoBALT (symbolic_correct, %)

Model Avg Syntax Semantics Pragmatics Phonetics/Phonology Morphology
Nemotron-Cascade-2-30B-A3B 32.29 32.00 46.05 30.86 4.84 7.14
kanana-2-30b-a3b-thinking-2601 36.14 33.00 49.30 28.40 19.35 30.95
gemma-4-26B-A4B-it 69.57 75.33 71.16 60.49 54.84 59.52

CLIcK (symbolic_correct, %)

Model Overall
Nemotron-Cascade-2-30B-A3B 65.81
kanana-2-30b-a3b-thinking-2601 73.23
gemma-4-26B-A4B-it 86.77

Summary by CodeRabbit

  • New Features

    • Added two Korean MCQ evaluation benchmarks: kobalt (10-choice) and click (5-choice), including prepared test data and evaluation configs.
  • Documentation

    • Multilingual evaluation docs updated: Korean MCQ list expanded to six; clarified per-option answer extraction behavior, kobalt’s 10-choice prompt usage, and that these benchmarks are single-language (no --languages flag).

Review Change Stack

bzantium added 3 commits May 16, 2026 14:51
KoBALT-700 (snunlp/KoBALT-700, arXiv:2505.16125): expert-written linguistic
competence benchmark with 700 ten-option MCQs across five domains
(Syntax, Semantics, Pragmatics, Phonetics/Phonology, Morphology). The
prepare script splits the embedded option lines (A through J) out of the
single 'Question' field so the standard MCQ template can render them; a
new 10-choice prompt config matches the per-sample extract_regex.

CLIcK (EunsuKim/CLIcK, LREC-COLING 2024): Korean cultural and linguistic
intelligence benchmark with 1995 items across 11 subcategories under two
broad domains (Culture, Language). The HF mirror has no subcategory
metadata, so the prepare script pulls the per-subcategory JSON files from
the canonical GitHub repo (rladmstn1714/CLIcK) and tags each item with
its paper-defined subcategory. Mixes 4- and 5-option MCQ, so the
existing 5-choice prompt is reused; text answers are mapped back to the
corresponding letter.

Signed-off-by: bzantium <ryumin93@gmail.com>
…Korean prepare scripts

Signed-off-by: bzantium <ryumin93@gmail.com>
Adds per-benchmark sections with reference numbers (Nemotron-Cascade-2,
kanana-2 thinking, gemma-4-26B) and example commands. Updates the
closing paragraph to reflect six Korean MCQ benchmarks total.

Signed-off-by: bzantium <ryumin93@gmail.com>
@coderabbitai

coderabbitai Bot commented May 16, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Adds KoBALT (10-choice) and CLIcK (5-choice) Korean MCQ benchmarks: dataset prepare scripts that emit JSONL, per-dataset evaluation config constants, a 10-choice Korean MCQ prompt template, and documentation updates with reference results and eval command examples.

Changes

Korean MCQ Benchmarks: KoBALT and CLIcK

Layer / File(s) Summary
KoBALT dataset preparation and configuration
nemo_skills/dataset/kobalt/__init__.py, nemo_skills/dataset/kobalt/prepare.py
KoBALT loads snunlp/KoBALT-700, parses combined questions into stem and 10 options (A–J) via regex, maps expected answers, sets Korean extract_regex and fixed flags, writes JSONL, and exports METRICS_TYPE plus a 10-choice GENERATION_ARGS prompt override.
CLIcK dataset preparation and configuration
nemo_skills/dataset/click/__init__.py, nemo_skills/dataset/click/prepare.py
CLIcK loads bzantium/CLIcK (test), maps correct-answer text to A–E, optionally prepends paragraph to question, sets extract_regex and subset_for_metrics, writes JSONL, and exports METRICS_TYPE with default 5-choice GENERATION_ARGS.
10-choice prompt template and documentation
nemo_skills/prompt/config/eval/korean/mcq-10choices.yaml, docs/evaluation/multilingual.md
Adds a Korean 10-choice MCQ prompt template requiring final 정답: A/.../J line. Updates evaluation docs with kobalt and click benchmark sections (sources, counts, reference tables, ns eval examples) and expands the Korean MCQ summary to six benchmarks.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

run GPU tests

Suggested reviewers

  • naymaraq
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately summarizes the main change: adding two new Korean benchmarks (KoBALT and CLIcK) to the evaluation suite.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
nemo_skills/dataset/click/prepare.py (1)

34-35: ⚡ Quick win

Pin CLIcK source to a commit/tag for reproducibility.

Using main in raw GitHub URLs makes prepared data mutable over time. Pin to a commit SHA (or immutable tag) so benchmark rows and reference numbers stay stable.

💡 Proposed fix
-_GITHUB_BASE = "https://raw.githubusercontent.com/rladmstn1714/CLIcK/main/Dataset"
+_GITHUB_BASE = "https://raw.githubusercontent.com/rladmstn1714/CLIcK/<pinned-commit-sha>/Dataset"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/click/prepare.py` around lines 34 - 35, The URL constant
_GITHUB_BASE currently points to the raw GitHub path using "main", which makes
downloaded dataset files mutable; update _GITHUB_BASE to reference a specific
immutable commit SHA or tag (e.g., replace "main" with the commit hash) so
prepared data is reproducible, and optionally allow overriding via an
environment variable or new constant to keep flexibility while defaulting to the
pinned SHA; locate and update the _GITHUB_BASE definition in prepare.py (and any
callers that build URLs from _SUBDIR) so all subsequent raw file requests use
the pinned reference.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/dataset/click/prepare.py`:
- Around line 103-107: The write_data_to_file function currently formats entries
while streaming them to disk, which can leave a partially written file if
formatting fails; change it to first precompute a list (or generator fully
realized) of formatted entries by calling format_entry for every entry (e.g.,
build formatted = [format_entry(e) for e in data] using tqdm for progress if
desired), and only after that opens output_file for writing and dumps each
preformatted item as JSONL; keep the same json.dump/fout.write("\n") logic but
ensure all formatting completes before the file is opened.

In `@nemo_skills/dataset/kobalt/prepare.py`:
- Around line 56-60: The current write_data_to_file interleaves formatting
(format_entry) and file writes so a formatting error can leave a truncated
JSONL; change write_data_to_file to first iterate over data and produce all
formatted JSON strings (or Python objects) into an in-memory list using
format_entry/json.dumps, and only after that successfully completes open
output_file and write the prepared lines with fout.write(... + "\n") (keep tqdm
desc using output_file.name); reference the write_data_to_file function and
format_entry when making this change.
- Around line 36-41: In _split_question, add explicit validation: check the
re.search result stored in first_option and if it's None raise a ValueError
stating the item is malformed/missing the "A:" marker and include the original
question_text (or a short excerpt) for debugging; after building options using
_OPTION_SPLIT and stripping with _OPTION_PREFIX, verify len(options) == 10 and
if not raise a ValueError that includes the parsed options count and the options
list (or excerpt) so the caller gets a clear error instead of silent malformed
MCQ rows.

---

Nitpick comments:
In `@nemo_skills/dataset/click/prepare.py`:
- Around line 34-35: The URL constant _GITHUB_BASE currently points to the raw
GitHub path using "main", which makes downloaded dataset files mutable; update
_GITHUB_BASE to reference a specific immutable commit SHA or tag (e.g., replace
"main" with the commit hash) so prepared data is reproducible, and optionally
allow overriding via an environment variable or new constant to keep flexibility
while defaulting to the pinned SHA; locate and update the _GITHUB_BASE
definition in prepare.py (and any callers that build URLs from _SUBDIR) so all
subsequent raw file requests use the pinned reference.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7b93c0a8-b51e-4a92-b473-7db5d3354a12

📥 Commits

Reviewing files that changed from the base of the PR and between 98a3cf3 and 1ea7c11.

📒 Files selected for processing (6)
  • docs/evaluation/multilingual.md
  • nemo_skills/dataset/click/__init__.py
  • nemo_skills/dataset/click/prepare.py
  • nemo_skills/dataset/kobalt/__init__.py
  • nemo_skills/dataset/kobalt/prepare.py
  • nemo_skills/prompt/config/eval/korean/mcq-10choices.yaml

Comment on lines +103 to +107
def write_data_to_file(output_file, data):
with open(output_file, "wt", encoding="utf-8") as fout:
for entry in tqdm(data, desc=f"Writing {output_file.name}"):
json.dump(format_entry(entry), fout, ensure_ascii=False)
fout.write("\n")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Precompute formatted entries before writing output JSONL.

Current flow can leave partially written files if any entry fails during formatting. Format first, write second.

As per coding guidelines, When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails.

💡 Proposed fix
 def write_data_to_file(output_file, data):
+    rows = [format_entry(entry) for entry in tqdm(data, desc="Formatting entries")]
     with open(output_file, "wt", encoding="utf-8") as fout:
-        for entry in tqdm(data, desc=f"Writing {output_file.name}"):
-            json.dump(format_entry(entry), fout, ensure_ascii=False)
+        for row in tqdm(rows, desc=f"Writing {output_file.name}"):
+            json.dump(row, fout, ensure_ascii=False)
             fout.write("\n")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/click/prepare.py` around lines 103 - 107, The
write_data_to_file function currently formats entries while streaming them to
disk, which can leave a partially written file if formatting fails; change it to
first precompute a list (or generator fully realized) of formatted entries by
calling format_entry for every entry (e.g., build formatted = [format_entry(e)
for e in data] using tqdm for progress if desired), and only after that opens
output_file for writing and dumps each preformatted item as JSONL; keep the same
json.dump/fout.write("\n") logic but ensure all formatting completes before the
file is opened.

Comment on lines +36 to +41
def _split_question(question_text: str) -> tuple[str, list[str]]:
first_option = re.search(r"\nA[:.\)]\s", question_text)
stem = question_text[: first_option.start()].rstrip("\n")
options_block = question_text[first_option.start() + 1 :]
options = [_OPTION_PREFIX.sub("", o).strip() for o in _OPTION_SPLIT.split(options_block)]
return stem, options

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate KoBALT option parsing explicitly.

If an item is malformed (missing A: marker or parsed options count not equal to 10), this currently fails with an opaque error or emits malformed MCQ rows. Add explicit validation and fail with a clear ValueError.

💡 Proposed fix
 def _split_question(question_text: str) -> tuple[str, list[str]]:
     first_option = re.search(r"\nA[:.\)]\s", question_text)
+    if first_option is None:
+        raise ValueError("Could not find the first option marker `A:` in question text.")
+
     stem = question_text[: first_option.start()].rstrip("\n")
     options_block = question_text[first_option.start() + 1 :]
     options = [_OPTION_PREFIX.sub("", o).strip() for o in _OPTION_SPLIT.split(options_block)]
+    if len(options) != 10:
+        raise ValueError(f"Expected 10 options (A-J), got {len(options)}.")
     return stem, options
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/kobalt/prepare.py` around lines 36 - 41, In
_split_question, add explicit validation: check the re.search result stored in
first_option and if it's None raise a ValueError stating the item is
malformed/missing the "A:" marker and include the original question_text (or a
short excerpt) for debugging; after building options using _OPTION_SPLIT and
stripping with _OPTION_PREFIX, verify len(options) == 10 and if not raise a
ValueError that includes the parsed options count and the options list (or
excerpt) so the caller gets a clear error instead of silent malformed MCQ rows.

Comment on lines +56 to +60
def write_data_to_file(output_file, data):
with open(output_file, "wt", encoding="utf-8") as fout:
for entry in tqdm(data, desc=f"Writing {output_file.name}"):
json.dump(format_entry(entry), fout, ensure_ascii=False)
fout.write("\n")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid partial JSONL writes on formatting failures.

Formatting and writing are interleaved, so a single bad sample can leave a truncated output file. Compute all formatted rows first, then open and write the file.

As per coding guidelines, When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails.

💡 Proposed fix
 def write_data_to_file(output_file, data):
+    rows = [format_entry(entry) for entry in tqdm(data, desc="Formatting entries")]
     with open(output_file, "wt", encoding="utf-8") as fout:
-        for entry in tqdm(data, desc=f"Writing {output_file.name}"):
-            json.dump(format_entry(entry), fout, ensure_ascii=False)
+        for row in tqdm(rows, desc=f"Writing {output_file.name}"):
+            json.dump(row, fout, ensure_ascii=False)
             fout.write("\n")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/kobalt/prepare.py` around lines 56 - 60, The current
write_data_to_file interleaves formatting (format_entry) and file writes so a
formatting error can leave a truncated JSONL; change write_data_to_file to first
iterate over data and produce all formatted JSON strings (or Python objects)
into an in-memory list using format_entry/json.dumps, and only after that
successfully completes open output_file and write the prepared lines with
fout.write(... + "\n") (keep tqdm desc using output_file.name); reference the
write_data_to_file function and format_entry when making this change.

@bzantium

Copy link
Copy Markdown
Contributor Author

For context on how widely these two benchmarks are used in the field. Both show up across the major Korean LLM technical reports and recent cross-lingual evaluation frameworks, so adding them to the Skills suite extends coverage to evals that maintainers of widely-deployed Korean LLMs already report.

CLIcK (arXiv:2403.06412, LREC-COLING 2024)

50+ citing papers on Semantic Scholar. Selected:

Korean LLM technical reports

Evaluation / benchmark frameworks

KoBALT (arXiv:2505.16125)

Newer (May 2025) but already picked up by:

Both fit the same role as the six Korean benchmarks added in #1375: standard Korean evals that the major Korean LLM teams (HyperCLOVA, Solar, Mi:dm, EXAONE, K-EXAONE) already report.

The upstream EunsuKim/CLIcK HF mirror drops the per-item subcategory
labels (only the GitHub repo's directory layout carries them), so the
previous prepare script had to pull each per-subcategory JSON file
directly over HTTP. bzantium/CLIcK is a clean HF mirror that restores
those labels (subcategory + broad_category fields), letting prepare.py
just call load_dataset() like every other Korean benchmark on this
branch.

Signed-off-by: bzantium <ryumin93@gmail.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
nemo_skills/dataset/click/prepare.py (1)

54-58: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Precompute formatted entries before writing output JSONL.

Current flow can leave partially written files if any entry fails during formatting. Format first, write second. This issue was flagged in a previous review and remains unaddressed.

As per coding guidelines, "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".

💡 Proposed fix
 def write_data_to_file(output_file, data):
+    rows = [format_entry(entry) for entry in tqdm(data, desc="Formatting entries")]
     with open(output_file, "wt", encoding="utf-8") as fout:
-        for entry in tqdm(data, desc=f"Writing {output_file.name}"):
-            json.dump(format_entry(entry), fout, ensure_ascii=False)
+        for row in tqdm(rows, desc=f"Writing {output_file.name}"):
+            json.dump(row, fout, ensure_ascii=False)
             fout.write("\n")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/click/prepare.py` around lines 54 - 58, The function
write_data_to_file currently formats each entry while the output file is open,
risking partially-written files if format_entry raises; change it to first fully
precompute/validate all formatted entries (call format_entry for each entry and
collect results, allowing exceptions to surface before any file I/O), then open
output_file and write the preformatted JSONL lines; reference the
write_data_to_file function and the format_entry call so you precompute entries
into a list (or other container) and only after successful formatting perform
the open/write loop to avoid partial files.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@nemo_skills/dataset/click/prepare.py`:
- Around line 54-58: The function write_data_to_file currently formats each
entry while the output file is open, risking partially-written files if
format_entry raises; change it to first fully precompute/validate all formatted
entries (call format_entry for each entry and collect results, allowing
exceptions to surface before any file I/O), then open output_file and write the
preformatted JSONL lines; reference the write_data_to_file function and the
format_entry call so you precompute entries into a list (or other container) and
only after successful formatting perform the open/write loop to avoid partial
files.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 894d0819-88a4-4e57-99f4-93c5a2adf141

📥 Commits

Reviewing files that changed from the base of the PR and between 1ea7c11 and 1f23c20.

📒 Files selected for processing (2)
  • docs/evaluation/multilingual.md
  • nemo_skills/dataset/click/prepare.py

Matches the existing _OPTION_SPLIT / _OPTION_PREFIX pattern in the same
file; no behavior change.

Signed-off-by: bzantium <ryumin93@gmail.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
nemo_skills/dataset/kobalt/prepare.py (2)

57-61: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Precompute all formatted rows before opening the output file.

Formatting and writing are interleaved. If format_entry fails mid-stream, output becomes a truncated JSONL. Build rows first, then write.

💡 Proposed fix
 def write_data_to_file(output_file, data):
+    rows = [format_entry(entry) for entry in tqdm(data, desc="Formatting entries")]
     with open(output_file, "wt", encoding="utf-8") as fout:
-        for entry in tqdm(data, desc=f"Writing {output_file.name}"):
-            json.dump(format_entry(entry), fout, ensure_ascii=False)
+        for row in tqdm(rows, desc=f"Writing {output_file.name}"):
+            json.dump(row, fout, ensure_ascii=False)
             fout.write("\n")

As per coding guidelines, When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/kobalt/prepare.py` around lines 57 - 61, The
write_data_to_file function interleaves formatting and writing which can leave a
truncated output if format_entry raises; change it to first iterate over data
and build a list (e.g., formatted_rows) by calling format_entry for each entry
(and optionally json.dumps each formatted row with ensure_ascii=False) while
using tqdm for progress, and only after that succeeds open output_file and write
all precomputed rows line-by-line (one JSONL entry per line), referencing
write_data_to_file, format_entry and output_file to locate the change.

38-42: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate malformed question parsing before slicing.

Line 39 assumes _FIRST_OPTION.search(...) always matches; malformed input causes an opaque AttributeError. Also enforce exactly 10 parsed options (A–J) before returning to avoid writing malformed MCQ rows.

💡 Proposed fix
 def _split_question(question_text: str) -> tuple[str, list[str]]:
     first_option = _FIRST_OPTION.search(question_text)
+    if first_option is None:
+        raise ValueError("Could not find the first option marker `A:` in question text.")
+
     stem = question_text[: first_option.start()].rstrip("\n")
     options_block = question_text[first_option.start() + 1 :]
     options = [_OPTION_PREFIX.sub("", o).strip() for o in _OPTION_SPLIT.split(options_block)]
+    if len(options) != 10:
+        raise ValueError(f"Expected 10 options (A-J), got {len(options)}.")
     return stem, options
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/kobalt/prepare.py` around lines 38 - 42, The code assumes
_FIRST_OPTION.search(question_text) always finds a match and slices using
first_option without validation, causing AttributeError on malformed inputs and
may produce wrong option counts; update the parsing in the function that uses
_FIRST_OPTION, first_option, stem, options_block, _OPTION_PREFIX and
_OPTION_SPLIT to first check that first_option is not None and raise or skip the
record with a clear error message if missing, and after building options
validate that len(options) == 10 (A–J) before returning; if the check fails,
handle it by logging or raising a descriptive exception so malformed MCQs are
not written downstream.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@nemo_skills/dataset/kobalt/prepare.py`:
- Around line 57-61: The write_data_to_file function interleaves formatting and
writing which can leave a truncated output if format_entry raises; change it to
first iterate over data and build a list (e.g., formatted_rows) by calling
format_entry for each entry (and optionally json.dumps each formatted row with
ensure_ascii=False) while using tqdm for progress, and only after that succeeds
open output_file and write all precomputed rows line-by-line (one JSONL entry
per line), referencing write_data_to_file, format_entry and output_file to
locate the change.
- Around line 38-42: The code assumes _FIRST_OPTION.search(question_text) always
finds a match and slices using first_option without validation, causing
AttributeError on malformed inputs and may produce wrong option counts; update
the parsing in the function that uses _FIRST_OPTION, first_option, stem,
options_block, _OPTION_PREFIX and _OPTION_SPLIT to first check that first_option
is not None and raise or skip the record with a clear error message if missing,
and after building options validate that len(options) == 10 (A–J) before
returning; if the check fails, handle it by logging or raising a descriptive
exception so malformed MCQs are not written downstream.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f75257ef-6102-4729-8dec-aa9ee2686121

📥 Commits

Reviewing files that changed from the base of the PR and between 1f23c20 and cf5d6a5.

📒 Files selected for processing (1)
  • nemo_skills/dataset/kobalt/prepare.py

@Kipok Kipok requested review from ekmb and shuoyangd June 9, 2026 17:45
@Kipok

Kipok commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

@shuoyangd @ekmb do you want to have a look and review?

@shuoyangd

Copy link
Copy Markdown
Collaborator

@shuoyangd @ekmb do you want to have a look and review?

Will look at this and #1378 next week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants