Feat/open cultural by gunicsroland · Pull Request #74 · nytud/hugme

gunicsroland · 2026-06-23T16:35:26Z

Summary

This PR introduces a new open-ended cultural evaluation task to HugMe.

Key Features

Open-Ended Question Format

The questions use 3 answer types:

entity - short factual answers
short_answer - brief sentence-length responses
explanation - longer, more detailed answers

Evaluation Methodology

The evaluation uses a hybrid scoring approach depending on the answer type.

Entity answers

For entity questions, the dataset provides a list of accepted aliases. Evaluation is performed using:

Exact string matching
Alias matching
Substring checks against accepted answers

This enables robust evaluation of short factual responses while remaining deterministic.

Short Answer & Explanation Answers

For short_answers and explanation type answers it uses an LLM-as-judge to evaluate the responses.
The judge assigns one of four verdicts according to a task-specific rubric:

Correct: if everything is good
Partially correct: If there are some good parts
Incorrect: If the answer is wrong
Unceratin: If the judge can't decide or we output is malformed

Scoring and Reporting

It saves the score categorically and a commulative score like in other tasks, but it also saves a stat of the verdict cases:

{
   "cultural-open" {
         "category_scores": {
                       ....
                },
        "stat_summary": {
            "total": 51,
            "correct": 8,
            "partially_correct": 2,
            "incorrect": 17,
            "uncertain": 24
            }
        }

Handling Uncertain Cases

Responses that receive an Uncertain verdict are automatically exported to a separate JSON file. This enables:

Manual review of ambiguous cases
Quality control and validity of the LLM judge
Re-scoring after rubric improvements

Result file modification

Furthermore the eval-result files in case of the cultural task now contain the question_id so later the multiple choice and open ended results can be cross evaluated.

…mary

matyasosvath · 2026-06-24T08:14:57Z

this file is not needed in the PR

matyasosvath · 2026-06-24T08:17:31Z

+def get_target_text(entry: Dict[str, Any]) -> Any:
+    if "gold_answer" in entry:
+        return entry["gold_answer"]
+    return None


this fn can be replaced with: entry.get("gold_answer")

>>> d = {"a": 0} >>> d.get("b") >>> d.get("b") is None True

matyasosvath · 2026-06-24T08:18:22Z

+    if answer_type == "entity":
+        return normalize_text(text, remove_punctuation=True)
+    if answer_type == "short_answer":
+        return normalize_text(text, remove_punctuation=True)


you can check these in one if condition

gunicsroland added 11 commits June 19, 2026 10:40

feat: added templates for open ended cultural questions

a039411

feat: basic pipeline evaluation for open cultural questions

471b3b8

fix: added question id to results, for cross evaluation later

24512ca

feat:updated answer text normalization

56bea05

feat: updated evaluation response

97f9931

feat: including scoring rubring in judge prompt

df12b39

feat: reshaped output to contain eval results and answer_category sum…

28f79d8

…mary

fix: replaced _ with - in filename

adcb931

fix: fixed open-cultural output formatting

6bf43e6

fix: changed open_cultural output formatting

4e1722f

chore: linting fixes

e8a1d8f

matyasosvath reviewed Jun 24, 2026

View reviewed changes

Comment thread config.json

matyasosvath Jun 24, 2026

Copy link
Copy Markdown

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is not needed in the PR

matyasosvath requested changes Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/open cultural#74

Feat/open cultural#74
gunicsroland wants to merge 11 commits into
masterfrom
feat/open-cultural

gunicsroland commented Jun 23, 2026

Uh oh!

matyasosvath Jun 24, 2026

Uh oh!

matyasosvath Jun 24, 2026

Uh oh!

matyasosvath Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

gunicsroland commented Jun 23, 2026

Summary

Key Features

Open-Ended Question Format

Evaluation Methodology

Entity answers

Short Answer & Explanation Answers

Scoring and Reporting

Handling Uncertain Cases

Result file modification

Uh oh!

matyasosvath Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

matyasosvath Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

matyasosvath Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants