Skip to content

Feat/open cultural#74

Open
gunicsroland wants to merge 11 commits into
masterfrom
feat/open-cultural
Open

Feat/open cultural#74
gunicsroland wants to merge 11 commits into
masterfrom
feat/open-cultural

Conversation

@gunicsroland

Copy link
Copy Markdown
Contributor

Summary

This PR introduces a new open-ended cultural evaluation task to HugMe.

Key Features

Open-Ended Question Format

The questions use 3 answer types:

  • entity - short factual answers
  • short_answer - brief sentence-length responses
  • explanation - longer, more detailed answers

Evaluation Methodology

The evaluation uses a hybrid scoring approach depending on the answer type.

Entity answers

For entity questions, the dataset provides a list of accepted aliases. Evaluation is performed using:

Exact string matching
Alias matching
Substring checks against accepted answers

This enables robust evaluation of short factual responses while remaining deterministic.

Short Answer & Explanation Answers

For short_answers and explanation type answers it uses an LLM-as-judge to evaluate the responses.
The judge assigns one of four verdicts according to a task-specific rubric:

  • Correct: if everything is good
  • Partially correct: If there are some good parts
  • Incorrect: If the answer is wrong
  • Unceratin: If the judge can't decide or we output is malformed

Scoring and Reporting

It saves the score categorically and a commulative score like in other tasks, but it also saves a stat of the verdict cases:

{
   "cultural-open" {
         "category_scores": {
                       ....
                },
        "stat_summary": {
            "total": 51,
            "correct": 8,
            "partially_correct": 2,
            "incorrect": 17,
            "uncertain": 24
            }
        }

Handling Uncertain Cases

Responses that receive an Uncertain verdict are automatically exported to a separate JSON file. This enables:

  • Manual review of ambiguous cases
  • Quality control and validity of the LLM judge
  • Re-scoring after rubric improvements

Result file modification

Furthermore the eval-result files in case of the cultural task now contain the question_id so later the multiple choice and open ended results can be cross evaluated.

Comment thread config.json

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is not needed in the PR

Comment on lines +37 to +40
def get_target_text(entry: Dict[str, Any]) -> Any:
if "gold_answer" in entry:
return entry["gold_answer"]
return None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fn can be replaced with: entry.get("gold_answer")

>>> d = {"a": 0}
>>> d.get("b")
>>> d.get("b") is None
True

Comment on lines +58 to +61
if answer_type == "entity":
return normalize_text(text, remove_punctuation=True)
if answer_type == "short_answer":
return normalize_text(text, remove_punctuation=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can check these in one if condition

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants