A tool for evaluating and comparing large language model performance on translation tasks using a chinese-english test dataset.
This project evaluates various LLMs on translation tasks, comparing their performance using COMET and generating visualizations of the results.
- Rye package manager
install dependencies:
rye sync- Create a .env file with required API keys and settings (see .env.example)
- Prepare your test dataset or use the provided passage_pairs_test_dataset.json
Run the evaluation script:
rye run python src/llm_eval_on_test_data/__init__.pyThe script will:
- Load test data from passage_pairs_test_dataset.json
- Fetch translations using the configured LLMs via the
translation_fetcher.pymodule - Store results in translations.db
- Generate performance comparisons and visualizations using the
plot.pymodule
Performance comparisons are visualized and saved as model_performance_comparison.png.
- llm_eval_on_test_data: Core source code
__init__.py: Entry pointtranslation_fetcher.py: Handles translation requests to LLMsplot.py: Generates visualizations of performance metrics