Add Korean benchmarks: KoBALT and CLIcK#1452
Conversation
KoBALT-700 (snunlp/KoBALT-700, arXiv:2505.16125): expert-written linguistic competence benchmark with 700 ten-option MCQs across five domains (Syntax, Semantics, Pragmatics, Phonetics/Phonology, Morphology). The prepare script splits the embedded option lines (A through J) out of the single 'Question' field so the standard MCQ template can render them; a new 10-choice prompt config matches the per-sample extract_regex. CLIcK (EunsuKim/CLIcK, LREC-COLING 2024): Korean cultural and linguistic intelligence benchmark with 1995 items across 11 subcategories under two broad domains (Culture, Language). The HF mirror has no subcategory metadata, so the prepare script pulls the per-subcategory JSON files from the canonical GitHub repo (rladmstn1714/CLIcK) and tags each item with its paper-defined subcategory. Mixes 4- and 5-option MCQ, so the existing 5-choice prompt is reused; text answers are mapped back to the corresponding letter. Signed-off-by: bzantium <ryumin93@gmail.com>
…Korean prepare scripts Signed-off-by: bzantium <ryumin93@gmail.com>
Adds per-benchmark sections with reference numbers (Nemotron-Cascade-2, kanana-2 thinking, gemma-4-26B) and example commands. Updates the closing paragraph to reflect six Korean MCQ benchmarks total. Signed-off-by: bzantium <ryumin93@gmail.com>
📝 WalkthroughWalkthroughAdds KoBALT (10-choice) and CLIcK (5-choice) Korean MCQ benchmarks: dataset prepare scripts that emit JSONL, per-dataset evaluation config constants, a 10-choice Korean MCQ prompt template, and documentation updates with reference results and eval command examples. ChangesKorean MCQ Benchmarks: KoBALT and CLIcK
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
nemo_skills/dataset/click/prepare.py (1)
34-35: ⚡ Quick winPin CLIcK source to a commit/tag for reproducibility.
Using
mainin raw GitHub URLs makes prepared data mutable over time. Pin to a commit SHA (or immutable tag) so benchmark rows and reference numbers stay stable.💡 Proposed fix
-_GITHUB_BASE = "https://raw.githubusercontent.com/rladmstn1714/CLIcK/main/Dataset" +_GITHUB_BASE = "https://raw.githubusercontent.com/rladmstn1714/CLIcK/<pinned-commit-sha>/Dataset"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/dataset/click/prepare.py` around lines 34 - 35, The URL constant _GITHUB_BASE currently points to the raw GitHub path using "main", which makes downloaded dataset files mutable; update _GITHUB_BASE to reference a specific immutable commit SHA or tag (e.g., replace "main" with the commit hash) so prepared data is reproducible, and optionally allow overriding via an environment variable or new constant to keep flexibility while defaulting to the pinned SHA; locate and update the _GITHUB_BASE definition in prepare.py (and any callers that build URLs from _SUBDIR) so all subsequent raw file requests use the pinned reference.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemo_skills/dataset/click/prepare.py`:
- Around line 103-107: The write_data_to_file function currently formats entries
while streaming them to disk, which can leave a partially written file if
formatting fails; change it to first precompute a list (or generator fully
realized) of formatted entries by calling format_entry for every entry (e.g.,
build formatted = [format_entry(e) for e in data] using tqdm for progress if
desired), and only after that opens output_file for writing and dumps each
preformatted item as JSONL; keep the same json.dump/fout.write("\n") logic but
ensure all formatting completes before the file is opened.
In `@nemo_skills/dataset/kobalt/prepare.py`:
- Around line 56-60: The current write_data_to_file interleaves formatting
(format_entry) and file writes so a formatting error can leave a truncated
JSONL; change write_data_to_file to first iterate over data and produce all
formatted JSON strings (or Python objects) into an in-memory list using
format_entry/json.dumps, and only after that successfully completes open
output_file and write the prepared lines with fout.write(... + "\n") (keep tqdm
desc using output_file.name); reference the write_data_to_file function and
format_entry when making this change.
- Around line 36-41: In _split_question, add explicit validation: check the
re.search result stored in first_option and if it's None raise a ValueError
stating the item is malformed/missing the "A:" marker and include the original
question_text (or a short excerpt) for debugging; after building options using
_OPTION_SPLIT and stripping with _OPTION_PREFIX, verify len(options) == 10 and
if not raise a ValueError that includes the parsed options count and the options
list (or excerpt) so the caller gets a clear error instead of silent malformed
MCQ rows.
---
Nitpick comments:
In `@nemo_skills/dataset/click/prepare.py`:
- Around line 34-35: The URL constant _GITHUB_BASE currently points to the raw
GitHub path using "main", which makes downloaded dataset files mutable; update
_GITHUB_BASE to reference a specific immutable commit SHA or tag (e.g., replace
"main" with the commit hash) so prepared data is reproducible, and optionally
allow overriding via an environment variable or new constant to keep flexibility
while defaulting to the pinned SHA; locate and update the _GITHUB_BASE
definition in prepare.py (and any callers that build URLs from _SUBDIR) so all
subsequent raw file requests use the pinned reference.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7b93c0a8-b51e-4a92-b473-7db5d3354a12
📒 Files selected for processing (6)
docs/evaluation/multilingual.mdnemo_skills/dataset/click/__init__.pynemo_skills/dataset/click/prepare.pynemo_skills/dataset/kobalt/__init__.pynemo_skills/dataset/kobalt/prepare.pynemo_skills/prompt/config/eval/korean/mcq-10choices.yaml
| def write_data_to_file(output_file, data): | ||
| with open(output_file, "wt", encoding="utf-8") as fout: | ||
| for entry in tqdm(data, desc=f"Writing {output_file.name}"): | ||
| json.dump(format_entry(entry), fout, ensure_ascii=False) | ||
| fout.write("\n") |
There was a problem hiding this comment.
Precompute formatted entries before writing output JSONL.
Current flow can leave partially written files if any entry fails during formatting. Format first, write second.
As per coding guidelines, When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails.
💡 Proposed fix
def write_data_to_file(output_file, data):
+ rows = [format_entry(entry) for entry in tqdm(data, desc="Formatting entries")]
with open(output_file, "wt", encoding="utf-8") as fout:
- for entry in tqdm(data, desc=f"Writing {output_file.name}"):
- json.dump(format_entry(entry), fout, ensure_ascii=False)
+ for row in tqdm(rows, desc=f"Writing {output_file.name}"):
+ json.dump(row, fout, ensure_ascii=False)
fout.write("\n")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/dataset/click/prepare.py` around lines 103 - 107, The
write_data_to_file function currently formats entries while streaming them to
disk, which can leave a partially written file if formatting fails; change it to
first precompute a list (or generator fully realized) of formatted entries by
calling format_entry for every entry (e.g., build formatted = [format_entry(e)
for e in data] using tqdm for progress if desired), and only after that opens
output_file for writing and dumps each preformatted item as JSONL; keep the same
json.dump/fout.write("\n") logic but ensure all formatting completes before the
file is opened.
| def _split_question(question_text: str) -> tuple[str, list[str]]: | ||
| first_option = re.search(r"\nA[:.\)]\s", question_text) | ||
| stem = question_text[: first_option.start()].rstrip("\n") | ||
| options_block = question_text[first_option.start() + 1 :] | ||
| options = [_OPTION_PREFIX.sub("", o).strip() for o in _OPTION_SPLIT.split(options_block)] | ||
| return stem, options |
There was a problem hiding this comment.
Validate KoBALT option parsing explicitly.
If an item is malformed (missing A: marker or parsed options count not equal to 10), this currently fails with an opaque error or emits malformed MCQ rows. Add explicit validation and fail with a clear ValueError.
💡 Proposed fix
def _split_question(question_text: str) -> tuple[str, list[str]]:
first_option = re.search(r"\nA[:.\)]\s", question_text)
+ if first_option is None:
+ raise ValueError("Could not find the first option marker `A:` in question text.")
+
stem = question_text[: first_option.start()].rstrip("\n")
options_block = question_text[first_option.start() + 1 :]
options = [_OPTION_PREFIX.sub("", o).strip() for o in _OPTION_SPLIT.split(options_block)]
+ if len(options) != 10:
+ raise ValueError(f"Expected 10 options (A-J), got {len(options)}.")
return stem, options🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/dataset/kobalt/prepare.py` around lines 36 - 41, In
_split_question, add explicit validation: check the re.search result stored in
first_option and if it's None raise a ValueError stating the item is
malformed/missing the "A:" marker and include the original question_text (or a
short excerpt) for debugging; after building options using _OPTION_SPLIT and
stripping with _OPTION_PREFIX, verify len(options) == 10 and if not raise a
ValueError that includes the parsed options count and the options list (or
excerpt) so the caller gets a clear error instead of silent malformed MCQ rows.
| def write_data_to_file(output_file, data): | ||
| with open(output_file, "wt", encoding="utf-8") as fout: | ||
| for entry in tqdm(data, desc=f"Writing {output_file.name}"): | ||
| json.dump(format_entry(entry), fout, ensure_ascii=False) | ||
| fout.write("\n") |
There was a problem hiding this comment.
Avoid partial JSONL writes on formatting failures.
Formatting and writing are interleaved, so a single bad sample can leave a truncated output file. Compute all formatted rows first, then open and write the file.
As per coding guidelines, When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails.
💡 Proposed fix
def write_data_to_file(output_file, data):
+ rows = [format_entry(entry) for entry in tqdm(data, desc="Formatting entries")]
with open(output_file, "wt", encoding="utf-8") as fout:
- for entry in tqdm(data, desc=f"Writing {output_file.name}"):
- json.dump(format_entry(entry), fout, ensure_ascii=False)
+ for row in tqdm(rows, desc=f"Writing {output_file.name}"):
+ json.dump(row, fout, ensure_ascii=False)
fout.write("\n")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/dataset/kobalt/prepare.py` around lines 56 - 60, The current
write_data_to_file interleaves formatting (format_entry) and file writes so a
formatting error can leave a truncated JSONL; change write_data_to_file to first
iterate over data and produce all formatted JSON strings (or Python objects)
into an in-memory list using format_entry/json.dumps, and only after that
successfully completes open output_file and write the prepared lines with
fout.write(... + "\n") (keep tqdm desc using output_file.name); reference the
write_data_to_file function and format_entry when making this change.
|
For context on how widely these two benchmarks are used in the field. Both show up across the major Korean LLM technical reports and recent cross-lingual evaluation frameworks, so adding them to the Skills suite extends coverage to evals that maintainers of widely-deployed Korean LLMs already report. CLIcK (arXiv:2403.06412, LREC-COLING 2024)50+ citing papers on Semantic Scholar. Selected: Korean LLM technical reports
Evaluation / benchmark frameworks
KoBALT (arXiv:2505.16125)Newer (May 2025) but already picked up by:
Both fit the same role as the six Korean benchmarks added in #1375: standard Korean evals that the major Korean LLM teams (HyperCLOVA, Solar, Mi:dm, EXAONE, K-EXAONE) already report. |
The upstream EunsuKim/CLIcK HF mirror drops the per-item subcategory labels (only the GitHub repo's directory layout carries them), so the previous prepare script had to pull each per-subcategory JSON file directly over HTTP. bzantium/CLIcK is a clean HF mirror that restores those labels (subcategory + broad_category fields), letting prepare.py just call load_dataset() like every other Korean benchmark on this branch. Signed-off-by: bzantium <ryumin93@gmail.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
nemo_skills/dataset/click/prepare.py (1)
54-58:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPrecompute formatted entries before writing output JSONL.
Current flow can leave partially written files if any entry fails during formatting. Format first, write second. This issue was flagged in a previous review and remains unaddressed.
As per coding guidelines, "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".
💡 Proposed fix
def write_data_to_file(output_file, data): + rows = [format_entry(entry) for entry in tqdm(data, desc="Formatting entries")] with open(output_file, "wt", encoding="utf-8") as fout: - for entry in tqdm(data, desc=f"Writing {output_file.name}"): - json.dump(format_entry(entry), fout, ensure_ascii=False) + for row in tqdm(rows, desc=f"Writing {output_file.name}"): + json.dump(row, fout, ensure_ascii=False) fout.write("\n")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/dataset/click/prepare.py` around lines 54 - 58, The function write_data_to_file currently formats each entry while the output file is open, risking partially-written files if format_entry raises; change it to first fully precompute/validate all formatted entries (call format_entry for each entry and collect results, allowing exceptions to surface before any file I/O), then open output_file and write the preformatted JSONL lines; reference the write_data_to_file function and the format_entry call so you precompute entries into a list (or other container) and only after successful formatting perform the open/write loop to avoid partial files.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@nemo_skills/dataset/click/prepare.py`:
- Around line 54-58: The function write_data_to_file currently formats each
entry while the output file is open, risking partially-written files if
format_entry raises; change it to first fully precompute/validate all formatted
entries (call format_entry for each entry and collect results, allowing
exceptions to surface before any file I/O), then open output_file and write the
preformatted JSONL lines; reference the write_data_to_file function and the
format_entry call so you precompute entries into a list (or other container) and
only after successful formatting perform the open/write loop to avoid partial
files.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 894d0819-88a4-4e57-99f4-93c5a2adf141
📒 Files selected for processing (2)
docs/evaluation/multilingual.mdnemo_skills/dataset/click/prepare.py
Matches the existing _OPTION_SPLIT / _OPTION_PREFIX pattern in the same file; no behavior change. Signed-off-by: bzantium <ryumin93@gmail.com>
There was a problem hiding this comment.
♻️ Duplicate comments (2)
nemo_skills/dataset/kobalt/prepare.py (2)
57-61:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPrecompute all formatted rows before opening the output file.
Formatting and writing are interleaved. If
format_entryfails mid-stream, output becomes a truncated JSONL. Build rows first, then write.💡 Proposed fix
def write_data_to_file(output_file, data): + rows = [format_entry(entry) for entry in tqdm(data, desc="Formatting entries")] with open(output_file, "wt", encoding="utf-8") as fout: - for entry in tqdm(data, desc=f"Writing {output_file.name}"): - json.dump(format_entry(entry), fout, ensure_ascii=False) + for row in tqdm(rows, desc=f"Writing {output_file.name}"): + json.dump(row, fout, ensure_ascii=False) fout.write("\n")As per coding guidelines,
When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/dataset/kobalt/prepare.py` around lines 57 - 61, The write_data_to_file function interleaves formatting and writing which can leave a truncated output if format_entry raises; change it to first iterate over data and build a list (e.g., formatted_rows) by calling format_entry for each entry (and optionally json.dumps each formatted row with ensure_ascii=False) while using tqdm for progress, and only after that succeeds open output_file and write all precomputed rows line-by-line (one JSONL entry per line), referencing write_data_to_file, format_entry and output_file to locate the change.
38-42:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate malformed question parsing before slicing.
Line 39 assumes
_FIRST_OPTION.search(...)always matches; malformed input causes an opaqueAttributeError. Also enforce exactly 10 parsed options (A–J) before returning to avoid writing malformed MCQ rows.💡 Proposed fix
def _split_question(question_text: str) -> tuple[str, list[str]]: first_option = _FIRST_OPTION.search(question_text) + if first_option is None: + raise ValueError("Could not find the first option marker `A:` in question text.") + stem = question_text[: first_option.start()].rstrip("\n") options_block = question_text[first_option.start() + 1 :] options = [_OPTION_PREFIX.sub("", o).strip() for o in _OPTION_SPLIT.split(options_block)] + if len(options) != 10: + raise ValueError(f"Expected 10 options (A-J), got {len(options)}.") return stem, options🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/dataset/kobalt/prepare.py` around lines 38 - 42, The code assumes _FIRST_OPTION.search(question_text) always finds a match and slices using first_option without validation, causing AttributeError on malformed inputs and may produce wrong option counts; update the parsing in the function that uses _FIRST_OPTION, first_option, stem, options_block, _OPTION_PREFIX and _OPTION_SPLIT to first check that first_option is not None and raise or skip the record with a clear error message if missing, and after building options validate that len(options) == 10 (A–J) before returning; if the check fails, handle it by logging or raising a descriptive exception so malformed MCQs are not written downstream.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@nemo_skills/dataset/kobalt/prepare.py`:
- Around line 57-61: The write_data_to_file function interleaves formatting and
writing which can leave a truncated output if format_entry raises; change it to
first iterate over data and build a list (e.g., formatted_rows) by calling
format_entry for each entry (and optionally json.dumps each formatted row with
ensure_ascii=False) while using tqdm for progress, and only after that succeeds
open output_file and write all precomputed rows line-by-line (one JSONL entry
per line), referencing write_data_to_file, format_entry and output_file to
locate the change.
- Around line 38-42: The code assumes _FIRST_OPTION.search(question_text) always
finds a match and slices using first_option without validation, causing
AttributeError on malformed inputs and may produce wrong option counts; update
the parsing in the function that uses _FIRST_OPTION, first_option, stem,
options_block, _OPTION_PREFIX and _OPTION_SPLIT to first check that first_option
is not None and raise or skip the record with a clear error message if missing,
and after building options validate that len(options) == 10 (A–J) before
returning; if the check fails, handle it by logging or raising a descriptive
exception so malformed MCQs are not written downstream.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f75257ef-6102-4729-8dec-aa9ee2686121
📒 Files selected for processing (1)
nemo_skills/dataset/kobalt/prepare.py
|
@shuoyangd @ekmb do you want to have a look and review? |
Will look at this and #1378 next week |
Summary
Adds two more Korean evaluation benchmarks to the multilingual suite, following the same pattern as the six Korean benchmarks added in #1375.
KoBALT-700 (HF, paper): 700 expert-written 10-choice MCQ items targeting deep Korean linguistic competence across five domains (Syntax 300, Semantics 215, Pragmatics 81, Phonetics/Phonology 62, Morphology 42). The dataset ships each question as one string with the ten options embedded as `A: ... J: ...` lines; the prepare script splits these into the standard MCQ fields so the existing `eval_type=multichoice` evaluator can be reused. A new 10-choice prompt config (`eval/korean/mcq-10choices.yaml`) matches the `정답: A/B/.../J` answer format.
CLIcK (HF, paper, LREC-COLING 2024): 1995 items across 11 paper-defined subcategories under two broad domains (8 Culture: Economy / Geography / History / Law / Politics / Popular / Society / Tradition; 3 Language: Functional / Grammar / Textual). The HF mirror is a single flat split with no subcategory metadata, so the prepare script pulls the per-subcategory JSON files directly from the canonical GitHub repo and tags each item. Mixes 4- and 5-choice MCQ; the existing 5-choice prompt covers both.
Both benchmarks are single-language Korean and reuse the existing `multichoice` evaluator and per-sample `extract_regex` matching `정답: `.
Reference numbers
Validated end to end on three production models (`ns eval` on a single 4-GPU node, judge-free MCQ). Numbers also added to `docs/evaluation/multilingual.md`.
KoBALT (symbolic_correct, %)
CLIcK (symbolic_correct, %)
Summary by CodeRabbit
New Features
Documentation