Skip to content

ProgrammedInsanity/llm_eval_on_test_data

Repository files navigation

llm-eval-on-test-data

A tool for evaluating and comparing large language model performance on translation tasks using a chinese-english test dataset.

Overview

This project evaluates various LLMs on translation tasks, comparing their performance using COMET and generating visualizations of the results.

Requirements

  • Rye package manager

Installation

install dependencies:

rye sync

Configuration

  1. Create a .env file with required API keys and settings (see .env.example)
  2. Prepare your test dataset or use the provided passage_pairs_test_dataset.json

Usage

Run the evaluation script:

rye run python src/llm_eval_on_test_data/__init__.py

The script will:

  • Load test data from passage_pairs_test_dataset.json
  • Fetch translations using the configured LLMs via the translation_fetcher.py module
  • Store results in translations.db
  • Generate performance comparisons and visualizations using the plot.py module

Results

Performance comparisons are visualized and saved as model_performance_comparison.png.

Project Structure

  • llm_eval_on_test_data: Core source code
    • __init__.py: Entry point
    • translation_fetcher.py: Handles translation requests to LLMs
    • plot.py: Generates visualizations of performance metrics

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages