feat/cultural-tasks-comparer by gunicsroland · Pull Request #75 · nytud/hugme

gunicsroland · 2026-06-23T18:04:43Z

Summary

This PR introduces a standalone evaluation comparison script that merges and analyzes results from ABCD-style and open-answer evaluation pipelines. It aligns per-question outputs and produces structured CSV reports for model-level, category-level, and outcome-level analysis based on question_id

Functionality

Loads and pairs ABCD and Open evaluation result files by filename convention
Aligns results at question_id level
Computes per-item comparison metrics:
- ABCD score + correctness (threshold-based)
- Open score + strict correctness
- Open verdict
- Outcome classification:
  - both_correct
  - abcd_only
  - open_only
  - both_wrong
Generates aggregated reports:
- Model summary (accuracy + gap)
- Category breakdown
- Outcome distribution
Exports CSV outputs per model and global summaries

Output files

For each model

merged_item_level_{model}.csv
item_level_differences_{model}.csv

Global

model_summary.csv
category_summary.csv
outcome_summary.csv

…tural tasks

matyasosvath · 2026-06-24T08:45:34Z

+def load_json(path):
+    with open(path, "r", encoding="utf-8") as f:
+        return json.load(f)


we have a fn for this in helper file called read.json fn

matyasosvath · 2026-06-24T08:48:54Z

+def strict_open_correct(score):
+    return score >= 1.0
+
+
+def abcd_correct(score):
+    return score >= 1.0


this is the same fn

matyasosvath · 2026-06-24T08:49:18Z

this is a script, please create a scripts folder in the root dir, and place it there

matyasosvath · 2026-06-24T08:50:00Z

+    print(f"Processed {len(merged_df)} questions")
+    print(f"Output written to: {output_dir}")


please using logging instead of prints

gunicsroland added 2 commits June 23, 2026 19:45

feat: added script to compare open and mutliple choice results of cul…

3774bbb

…tural tasks

chore: linter fixes

d1265c2

matyasosvath requested changes Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat/cultural-tasks-comparer#75

feat/cultural-tasks-comparer#75
gunicsroland wants to merge 2 commits into
masterfrom
feat/cultural-tasks-comparer

gunicsroland commented Jun 23, 2026

Uh oh!

matyasosvath Jun 24, 2026

Uh oh!

matyasosvath Jun 24, 2026

Uh oh!

matyasosvath Jun 24, 2026

Uh oh!

matyasosvath Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		print(f"Processed {len(merged_df)} questions")
		print(f"Output written to: {output_dir}")

Uh oh!

Conversation

gunicsroland commented Jun 23, 2026

Summary

Functionality

Output files

For each model

Global

Uh oh!

matyasosvath Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

matyasosvath Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

matyasosvath Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

matyasosvath Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants