Add Korean benchmarks: KoBALT and CLIcK by bzantium · Pull Request #1452 · NVIDIA-NeMo/Skills

bzantium · 2026-05-16T08:08:54Z

Summary

Adds two more Korean evaluation benchmarks to the multilingual suite, following the same pattern as the six Korean benchmarks added in #1375.

KoBALT-700 (HF, paper): 700 expert-written 10-choice MCQ items targeting deep Korean linguistic competence across five domains (Syntax 300, Semantics 215, Pragmatics 81, Phonetics/Phonology 62, Morphology 42). The dataset ships each question as one string with the ten options embedded as `A: ... J: ...` lines; the prepare script splits these into the standard MCQ fields so the existing `eval_type=multichoice` evaluator can be reused. A new 10-choice prompt config (`eval/korean/mcq-10choices.yaml`) matches the `정답: A/B/.../J` answer format.
CLIcK (HF, paper, LREC-COLING 2024): 1995 items across 11 paper-defined subcategories under two broad domains (8 Culture: Economy / Geography / History / Law / Politics / Popular / Society / Tradition; 3 Language: Functional / Grammar / Textual). The HF mirror is a single flat split with no subcategory metadata, so the prepare script pulls the per-subcategory JSON files directly from the canonical GitHub repo and tags each item. Mixes 4- and 5-choice MCQ; the existing 5-choice prompt covers both.

Both benchmarks are single-language Korean and reuse the existing `multichoice` evaluator and per-sample `extract_regex` matching `정답: `.

Reference numbers

Validated end to end on three production models (`ns eval` on a single 4-GPU node, judge-free MCQ). Numbers also added to `docs/evaluation/multilingual.md`.

KoBALT (symbolic_correct, %)

Model	Avg	Syntax	Semantics	Pragmatics	Phonetics/Phonology	Morphology
Nemotron-Cascade-2-30B-A3B	32.29	32.00	46.05	30.86	4.84	7.14
kanana-2-30b-a3b-thinking-2601	36.14	33.00	49.30	28.40	19.35	30.95
gemma-4-26B-A4B-it	69.57	75.33	71.16	60.49	54.84	59.52

CLIcK (symbolic_correct, %)

Model	Overall
Nemotron-Cascade-2-30B-A3B	65.81
kanana-2-30b-a3b-thinking-2601	73.23
gemma-4-26B-A4B-it	86.77

Summary by CodeRabbit

New Features
- Added two Korean MCQ evaluation benchmarks: kobalt (10-choice) and click (5-choice), including prepared test data and evaluation configs.
Documentation
- Multilingual evaluation docs updated: Korean MCQ list expanded to six; clarified per-option answer extraction behavior, kobalt’s 10-choice prompt usage, and that these benchmarks are single-language (no --languages flag).

KoBALT-700 (snunlp/KoBALT-700, arXiv:2505.16125): expert-written linguistic competence benchmark with 700 ten-option MCQs across five domains (Syntax, Semantics, Pragmatics, Phonetics/Phonology, Morphology). The prepare script splits the embedded option lines (A through J) out of the single 'Question' field so the standard MCQ template can render them; a new 10-choice prompt config matches the per-sample extract_regex. CLIcK (EunsuKim/CLIcK, LREC-COLING 2024): Korean cultural and linguistic intelligence benchmark with 1995 items across 11 subcategories under two broad domains (Culture, Language). The HF mirror has no subcategory metadata, so the prepare script pulls the per-subcategory JSON files from the canonical GitHub repo (rladmstn1714/CLIcK) and tags each item with its paper-defined subcategory. Mixes 4- and 5-option MCQ, so the existing 5-choice prompt is reused; text answers are mapped back to the corresponding letter. Signed-off-by: bzantium <ryumin93@gmail.com>

…Korean prepare scripts Signed-off-by: bzantium <ryumin93@gmail.com>

Adds per-benchmark sections with reference numbers (Nemotron-Cascade-2, kanana-2 thinking, gemma-4-26B) and example commands. Updates the closing paragraph to reflect six Korean MCQ benchmarks total. Signed-off-by: bzantium <ryumin93@gmail.com>

coderabbitai · 2026-05-16T08:11:36Z

📝 Walkthrough

Walkthrough

Adds KoBALT (10-choice) and CLIcK (5-choice) Korean MCQ benchmarks: dataset prepare scripts that emit JSONL, per-dataset evaluation config constants, a 10-choice Korean MCQ prompt template, and documentation updates with reference results and eval command examples.

Changes

Korean MCQ Benchmarks: KoBALT and CLIcK

Layer / File(s)	Summary
KoBALT dataset preparation and configuration `nemo_skills/dataset/kobalt/__init__.py`, `nemo_skills/dataset/kobalt/prepare.py`	KoBALT loads `snunlp/KoBALT-700`, parses combined questions into stem and 10 options (A–J) via regex, maps expected answers, sets Korean `extract_regex` and fixed flags, writes JSONL, and exports `METRICS_TYPE` plus a 10-choice `GENERATION_ARGS` prompt override.
CLIcK dataset preparation and configuration `nemo_skills/dataset/click/__init__.py`, `nemo_skills/dataset/click/prepare.py`	CLIcK loads `bzantium/CLIcK` (test), maps correct-answer text to A–E, optionally prepends `paragraph` to `question`, sets `extract_regex` and `subset_for_metrics`, writes JSONL, and exports `METRICS_TYPE` with default 5-choice `GENERATION_ARGS`.
10-choice prompt template and documentation `nemo_skills/prompt/config/eval/korean/mcq-10choices.yaml`, `docs/evaluation/multilingual.md`	Adds a Korean 10-choice MCQ prompt template requiring final `정답: A/.../J` line. Updates evaluation docs with `kobalt` and `click` benchmark sections (sources, counts, reference tables, `ns eval` examples) and expands the Korean MCQ summary to six benchmarks.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

run GPU tests

Suggested reviewers

naymaraq

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title accurately summarizes the main change: adding two new Korean benchmarks (KoBALT and CLIcK) to the evaluation suite.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

nemo_skills/dataset/click/prepare.py (1)

34-35: ⚡ Quick win

Pin CLIcK source to a commit/tag for reproducibility.

Using main in raw GitHub URLs makes prepared data mutable over time. Pin to a commit SHA (or immutable tag) so benchmark rows and reference numbers stay stable.

💡 Proposed fix

-_GITHUB_BASE = "https://raw.githubusercontent.com/rladmstn1714/CLIcK/main/Dataset"
+_GITHUB_BASE = "https://raw.githubusercontent.com/rladmstn1714/CLIcK/<pinned-commit-sha>/Dataset"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/click/prepare.py` around lines 34 - 35, The URL constant
_GITHUB_BASE currently points to the raw GitHub path using "main", which makes
downloaded dataset files mutable; update _GITHUB_BASE to reference a specific
immutable commit SHA or tag (e.g., replace "main" with the commit hash) so
prepared data is reproducible, and optionally allow overriding via an
environment variable or new constant to keep flexibility while defaulting to the
pinned SHA; locate and update the _GITHUB_BASE definition in prepare.py (and any
callers that build URLs from _SUBDIR) so all subsequent raw file requests use
the pinned reference.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/dataset/click/prepare.py`:
- Around line 103-107: The write_data_to_file function currently formats entries
while streaming them to disk, which can leave a partially written file if
formatting fails; change it to first precompute a list (or generator fully
realized) of formatted entries by calling format_entry for every entry (e.g.,
build formatted = [format_entry(e) for e in data] using tqdm for progress if
desired), and only after that opens output_file for writing and dumps each
preformatted item as JSONL; keep the same json.dump/fout.write("\n") logic but
ensure all formatting completes before the file is opened.

In `@nemo_skills/dataset/kobalt/prepare.py`:
- Around line 56-60: The current write_data_to_file interleaves formatting
(format_entry) and file writes so a formatting error can leave a truncated
JSONL; change write_data_to_file to first iterate over data and produce all
formatted JSON strings (or Python objects) into an in-memory list using
format_entry/json.dumps, and only after that successfully completes open
output_file and write the prepared lines with fout.write(... + "\n") (keep tqdm
desc using output_file.name); reference the write_data_to_file function and
format_entry when making this change.
- Around line 36-41: In _split_question, add explicit validation: check the
re.search result stored in first_option and if it's None raise a ValueError
stating the item is malformed/missing the "A:" marker and include the original
question_text (or a short excerpt) for debugging; after building options using
_OPTION_SPLIT and stripping with _OPTION_PREFIX, verify len(options) == 10 and
if not raise a ValueError that includes the parsed options count and the options
list (or excerpt) so the caller gets a clear error instead of silent malformed
MCQ rows.

---

Nitpick comments:
In `@nemo_skills/dataset/click/prepare.py`:
- Around line 34-35: The URL constant _GITHUB_BASE currently points to the raw
GitHub path using "main", which makes downloaded dataset files mutable; update
_GITHUB_BASE to reference a specific immutable commit SHA or tag (e.g., replace
"main" with the commit hash) so prepared data is reproducible, and optionally
allow overriding via an environment variable or new constant to keep flexibility
while defaulting to the pinned SHA; locate and update the _GITHUB_BASE
definition in prepare.py (and any callers that build URLs from _SUBDIR) so all
subsequent raw file requests use the pinned reference.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7b93c0a8-b51e-4a92-b473-7db5d3354a12

📥 Commits

Reviewing files that changed from the base of the PR and between 98a3cf3 and 1ea7c11.

📒 Files selected for processing (6)

docs/evaluation/multilingual.md
nemo_skills/dataset/click/__init__.py
nemo_skills/dataset/click/prepare.py
nemo_skills/dataset/kobalt/__init__.py
nemo_skills/dataset/kobalt/prepare.py
nemo_skills/prompt/config/eval/korean/mcq-10choices.yaml

coderabbitai · 2026-05-16T08:11:39Z

+def write_data_to_file(output_file, data):
+    with open(output_file, "wt", encoding="utf-8") as fout:
+        for entry in tqdm(data, desc=f"Writing {output_file.name}"):
+            json.dump(format_entry(entry), fout, ensure_ascii=False)
+            fout.write("\n")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Precompute formatted entries before writing output JSONL.

Current flow can leave partially written files if any entry fails during formatting. Format first, write second.

As per coding guidelines, When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails.

💡 Proposed fix

def write_data_to_file(output_file, data): + rows = [format_entry(entry) for entry in tqdm(data, desc="Formatting entries")] with open(output_file, "wt", encoding="utf-8") as fout: - for entry in tqdm(data, desc=f"Writing {output_file.name}"): - json.dump(format_entry(entry), fout, ensure_ascii=False) + for row in tqdm(rows, desc=f"Writing {output_file.name}"): + json.dump(row, fout, ensure_ascii=False) fout.write("\n")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/dataset/click/prepare.py` around lines 103 - 107, The write_data_to_file function currently formats entries while streaming them to disk, which can leave a partially written file if formatting fails; change it to first precompute a list (or generator fully realized) of formatted entries by calling format_entry for every entry (e.g., build formatted = [format_entry(e) for e in data] using tqdm for progress if desired), and only after that opens output_file for writing and dumps each preformatted item as JSONL; keep the same json.dump/fout.write("\n") logic but ensure all formatting completes before the file is opened.

coderabbitai · 2026-05-16T08:11:39Z

+def _split_question(question_text: str) -> tuple[str, list[str]]:
+    first_option = re.search(r"\nA[:.\)]\s", question_text)
+    stem = question_text[: first_option.start()].rstrip("\n")
+    options_block = question_text[first_option.start() + 1 :]
+    options = [_OPTION_PREFIX.sub("", o).strip() for o in _OPTION_SPLIT.split(options_block)]
+    return stem, options


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate KoBALT option parsing explicitly.

If an item is malformed (missing A: marker or parsed options count not equal to 10), this currently fails with an opaque error or emits malformed MCQ rows. Add explicit validation and fail with a clear ValueError.

💡 Proposed fix

def _split_question(question_text: str) -> tuple[str, list[str]]: first_option = re.search(r"\nA[:.\)]\s", question_text) + if first_option is None: + raise ValueError("Could not find the first option marker `A:` in question text.") + stem = question_text[: first_option.start()].rstrip("\n") options_block = question_text[first_option.start() + 1 :] options = [_OPTION_PREFIX.sub("", o).strip() for o in _OPTION_SPLIT.split(options_block)] + if len(options) != 10: + raise ValueError(f"Expected 10 options (A-J), got {len(options)}.") return stem, options

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/dataset/kobalt/prepare.py` around lines 36 - 41, In _split_question, add explicit validation: check the re.search result stored in first_option and if it's None raise a ValueError stating the item is malformed/missing the "A:" marker and include the original question_text (or a short excerpt) for debugging; after building options using _OPTION_SPLIT and stripping with _OPTION_PREFIX, verify len(options) == 10 and if not raise a ValueError that includes the parsed options count and the options list (or excerpt) so the caller gets a clear error instead of silent malformed MCQ rows.

coderabbitai · 2026-05-16T08:11:40Z

+def write_data_to_file(output_file, data):
+    with open(output_file, "wt", encoding="utf-8") as fout:
+        for entry in tqdm(data, desc=f"Writing {output_file.name}"):
+            json.dump(format_entry(entry), fout, ensure_ascii=False)
+            fout.write("\n")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid partial JSONL writes on formatting failures.

Formatting and writing are interleaved, so a single bad sample can leave a truncated output file. Compute all formatted rows first, then open and write the file.

As per coding guidelines, When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails.

💡 Proposed fix

def write_data_to_file(output_file, data): + rows = [format_entry(entry) for entry in tqdm(data, desc="Formatting entries")] with open(output_file, "wt", encoding="utf-8") as fout: - for entry in tqdm(data, desc=f"Writing {output_file.name}"): - json.dump(format_entry(entry), fout, ensure_ascii=False) + for row in tqdm(rows, desc=f"Writing {output_file.name}"): + json.dump(row, fout, ensure_ascii=False) fout.write("\n")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/dataset/kobalt/prepare.py` around lines 56 - 60, The current write_data_to_file interleaves formatting (format_entry) and file writes so a formatting error can leave a truncated JSONL; change write_data_to_file to first iterate over data and produce all formatted JSON strings (or Python objects) into an in-memory list using format_entry/json.dumps, and only after that successfully completes open output_file and write the prepared lines with fout.write(... + "\n") (keep tqdm desc using output_file.name); reference the write_data_to_file function and format_entry when making this change.

bzantium · 2026-05-16T08:24:25Z

For context on how widely these two benchmarks are used in the field. Both show up across the major Korean LLM technical reports and recent cross-lingual evaluation frameworks, so adding them to the Skills suite extends coverage to evals that maintainers of widely-deployed Korean LLMs already report.

CLIcK (arXiv:2403.06412, LREC-COLING 2024)

50+ citing papers on Semantic Scholar. Selected:

Korean LLM technical reports

HyperCLOVA X 32B Think (arXiv:2601.03286)
HyperCLOVA X THINK Technical Report (arXiv:2506.22403)
Solar Open Technical Report (arXiv:2601.07022)
Mi:dm K 2.5 Pro (arXiv:2603.18788)
Mi:dm 2.0 Korea-centric Bilingual Language Models (arXiv:2601.09066)
K-EXAONE Technical Report (arXiv:2601.01739)

Evaluation / benchmark frameworks

BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation (arXiv:2506.00482)
From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite (arXiv:2507.08924, the same paper used by the kmmlu-pro / kmmlu-redux benchmarks landed in Add Korean benchmarks: KMMLU, KMMLU-Pro, KMMLU-Redux, KoIFEval, KorMedMCQA, KoSimpleQA #1375)
Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of LLMs (arXiv:2503.22968)
KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in LLMs (arXiv:2510.15558)
Polishing Every Facet of the GEM: Testing Linguistic Competence of LLMs and Humans in Korean (arXiv:2506.01237)

KoBALT (arXiv:2505.16125)

Newer (May 2025) but already picked up by:

EXAONE 4.5 Technical Report (arXiv:2604.08644)
Mi:dm K 2.5 Pro (arXiv:2603.18788)
Solar Open Technical Report (arXiv:2601.07022)
Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought (arXiv:2510.04230)

Both fit the same role as the six Korean benchmarks added in #1375: standard Korean evals that the major Korean LLM teams (HyperCLOVA, Solar, Mi:dm, EXAONE, K-EXAONE) already report.

The upstream EunsuKim/CLIcK HF mirror drops the per-item subcategory labels (only the GitHub repo's directory layout carries them), so the previous prepare script had to pull each per-subcategory JSON file directly over HTTP. bzantium/CLIcK is a clean HF mirror that restores those labels (subcategory + broad_category fields), letting prepare.py just call load_dataset() like every other Korean benchmark on this branch. Signed-off-by: bzantium <ryumin93@gmail.com>

coderabbitai

♻️ Duplicate comments (1)

nemo_skills/dataset/click/prepare.py (1)

54-58: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Precompute formatted entries before writing output JSONL.

Current flow can leave partially written files if any entry fails during formatting. Format first, write second. This issue was flagged in a previous review and remains unaddressed.

As per coding guidelines, "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".

💡 Proposed fix

 def write_data_to_file(output_file, data):
+    rows = [format_entry(entry) for entry in tqdm(data, desc="Formatting entries")]
     with open(output_file, "wt", encoding="utf-8") as fout:
-        for entry in tqdm(data, desc=f"Writing {output_file.name}"):
-            json.dump(format_entry(entry), fout, ensure_ascii=False)
+        for row in tqdm(rows, desc=f"Writing {output_file.name}"):
+            json.dump(row, fout, ensure_ascii=False)
             fout.write("\n")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/click/prepare.py` around lines 54 - 58, The function
write_data_to_file currently formats each entry while the output file is open,
risking partially-written files if format_entry raises; change it to first fully
precompute/validate all formatted entries (call format_entry for each entry and
collect results, allowing exceptions to surface before any file I/O), then open
output_file and write the preformatted JSONL lines; reference the
write_data_to_file function and the format_entry call so you precompute entries
into a list (or other container) and only after successful formatting perform
the open/write loop to avoid partial files.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@nemo_skills/dataset/click/prepare.py`:
- Around line 54-58: The function write_data_to_file currently formats each
entry while the output file is open, risking partially-written files if
format_entry raises; change it to first fully precompute/validate all formatted
entries (call format_entry for each entry and collect results, allowing
exceptions to surface before any file I/O), then open output_file and write the
preformatted JSONL lines; reference the write_data_to_file function and the
format_entry call so you precompute entries into a list (or other container) and
only after successful formatting perform the open/write loop to avoid partial
files.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 894d0819-88a4-4e57-99f4-93c5a2adf141

📥 Commits

Reviewing files that changed from the base of the PR and between 1ea7c11 and 1f23c20.

📒 Files selected for processing (2)

docs/evaluation/multilingual.md
nemo_skills/dataset/click/prepare.py

Matches the existing _OPTION_SPLIT / _OPTION_PREFIX pattern in the same file; no behavior change. Signed-off-by: bzantium <ryumin93@gmail.com>

coderabbitai

♻️ Duplicate comments (2)

nemo_skills/dataset/kobalt/prepare.py (2)

57-61: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Precompute all formatted rows before opening the output file.

Formatting and writing are interleaved. If format_entry fails mid-stream, output becomes a truncated JSONL. Build rows first, then write.

💡 Proposed fix

 def write_data_to_file(output_file, data):
+    rows = [format_entry(entry) for entry in tqdm(data, desc="Formatting entries")]
     with open(output_file, "wt", encoding="utf-8") as fout:
-        for entry in tqdm(data, desc=f"Writing {output_file.name}"):
-            json.dump(format_entry(entry), fout, ensure_ascii=False)
+        for row in tqdm(rows, desc=f"Writing {output_file.name}"):
+            json.dump(row, fout, ensure_ascii=False)
             fout.write("\n")

As per coding guidelines, When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/kobalt/prepare.py` around lines 57 - 61, The
write_data_to_file function interleaves formatting and writing which can leave a
truncated output if format_entry raises; change it to first iterate over data
and build a list (e.g., formatted_rows) by calling format_entry for each entry
(and optionally json.dumps each formatted row with ensure_ascii=False) while
using tqdm for progress, and only after that succeeds open output_file and write
all precomputed rows line-by-line (one JSONL entry per line), referencing
write_data_to_file, format_entry and output_file to locate the change.

38-42: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate malformed question parsing before slicing.

Line 39 assumes _FIRST_OPTION.search(...) always matches; malformed input causes an opaque AttributeError. Also enforce exactly 10 parsed options (A–J) before returning to avoid writing malformed MCQ rows.

💡 Proposed fix

 def _split_question(question_text: str) -> tuple[str, list[str]]:
     first_option = _FIRST_OPTION.search(question_text)
+    if first_option is None:
+        raise ValueError("Could not find the first option marker `A:` in question text.")
+
     stem = question_text[: first_option.start()].rstrip("\n")
     options_block = question_text[first_option.start() + 1 :]
     options = [_OPTION_PREFIX.sub("", o).strip() for o in _OPTION_SPLIT.split(options_block)]
+    if len(options) != 10:
+        raise ValueError(f"Expected 10 options (A-J), got {len(options)}.")
     return stem, options

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/kobalt/prepare.py` around lines 38 - 42, The code assumes
_FIRST_OPTION.search(question_text) always finds a match and slices using
first_option without validation, causing AttributeError on malformed inputs and
may produce wrong option counts; update the parsing in the function that uses
_FIRST_OPTION, first_option, stem, options_block, _OPTION_PREFIX and
_OPTION_SPLIT to first check that first_option is not None and raise or skip the
record with a clear error message if missing, and after building options
validate that len(options) == 10 (A–J) before returning; if the check fails,
handle it by logging or raising a descriptive exception so malformed MCQs are
not written downstream.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@nemo_skills/dataset/kobalt/prepare.py`:
- Around line 57-61: The write_data_to_file function interleaves formatting and
writing which can leave a truncated output if format_entry raises; change it to
first iterate over data and build a list (e.g., formatted_rows) by calling
format_entry for each entry (and optionally json.dumps each formatted row with
ensure_ascii=False) while using tqdm for progress, and only after that succeeds
open output_file and write all precomputed rows line-by-line (one JSONL entry
per line), referencing write_data_to_file, format_entry and output_file to
locate the change.
- Around line 38-42: The code assumes _FIRST_OPTION.search(question_text) always
finds a match and slices using first_option without validation, causing
AttributeError on malformed inputs and may produce wrong option counts; update
the parsing in the function that uses _FIRST_OPTION, first_option, stem,
options_block, _OPTION_PREFIX and _OPTION_SPLIT to first check that first_option
is not None and raise or skip the record with a clear error message if missing,
and after building options validate that len(options) == 10 (A–J) before
returning; if the check fails, handle it by logging or raising a descriptive
exception so malformed MCQs are not written downstream.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f75257ef-6102-4729-8dec-aa9ee2686121

📥 Commits

Reviewing files that changed from the base of the PR and between 1f23c20 and cf5d6a5.

📒 Files selected for processing (1)

nemo_skills/dataset/kobalt/prepare.py

Kipok · 2026-06-09T17:45:24Z

@shuoyangd @ekmb do you want to have a look and review?

shuoyangd · 2026-06-10T21:49:24Z

@shuoyangd @ekmb do you want to have a look and review?

Will look at this and #1378 next week

bzantium added 3 commits May 16, 2026 14:51

refactor(click): use write_data_to_file helper for parity with other …

0eda020

…Korean prepare scripts Signed-off-by: bzantium <ryumin93@gmail.com>

coderabbitai Bot reviewed May 16, 2026

View reviewed changes

refactor(kobalt): precompile _FIRST_OPTION regex at module level

cf5d6a5

Matches the existing _OPTION_SPLIT / _OPTION_PREFIX pattern in the same file; no behavior change. Signed-off-by: bzantium <ryumin93@gmail.com>

coderabbitai Bot reviewed May 16, 2026

View reviewed changes

Kipok requested review from ekmb and shuoyangd June 9, 2026 17:45

Uh oh!

Conversation

bzantium commented May 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reference numbers

KoBALT (symbolic_correct, %)

CLIcK (symbolic_correct, %)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

bzantium commented May 16, 2026

CLIcK (arXiv:2403.06412, LREC-COLING 2024)

KoBALT (arXiv:2505.16125)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Kipok commented Jun 9, 2026

Uh oh!

shuoyangd commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bzantium commented May 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 16, 2026 •

edited

Loading