Skip to content

feat/cultural-tasks-comparer#75

Open
gunicsroland wants to merge 2 commits into
masterfrom
feat/cultural-tasks-comparer
Open

feat/cultural-tasks-comparer#75
gunicsroland wants to merge 2 commits into
masterfrom
feat/cultural-tasks-comparer

Conversation

@gunicsroland

Copy link
Copy Markdown
Contributor

Summary

This PR introduces a standalone evaluation comparison script that merges and analyzes results from ABCD-style and open-answer evaluation pipelines. It aligns per-question outputs and produces structured CSV reports for model-level, category-level, and outcome-level analysis based on question_id

Functionality

  • Loads and pairs ABCD and Open evaluation result files by filename convention
  • Aligns results at question_id level
  • Computes per-item comparison metrics:
    • ABCD score + correctness (threshold-based)
    • Open score + strict correctness
    • Open verdict
    • Outcome classification:
      • both_correct
      • abcd_only
      • open_only
      • both_wrong
  • Generates aggregated reports:
    • Model summary (accuracy + gap)
    • Category breakdown
    • Outcome distribution
  • Exports CSV outputs per model and global summaries

Output files

For each model

  • merged_item_level_{model}.csv
  • item_level_differences_{model}.csv

Global

  • model_summary.csv
  • category_summary.csv
  • outcome_summary.csv

Comment on lines +9 to +11
def load_json(path):
with open(path, "r", encoding="utf-8") as f:
return json.load(f)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a fn for this in helper file called read.json fn

Comment on lines +14 to +19
def strict_open_correct(score):
return score >= 1.0


def abcd_correct(score):
return score >= 1.0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the same fn

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a script, please create a scripts folder in the root dir, and place it there

Comment on lines +193 to +194
print(f"Processed {len(merged_df)} questions")
print(f"Output written to: {output_dir}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please using logging instead of prints

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants