Skip to content

Mikslo/RAGvsFinetune

Repository files navigation

PubMedQA benchmark (Hugging Face / Ollama)

1) Setup

python3 -m venv .venv
source .venv/bin/activate
# For GPU runs, prefer a CUDA-enabled PyTorch wheel:
pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

GPU check:

python -c "import torch; print('cuda_available=', torch.cuda.is_available()); print('torch_cuda=', torch.version.cuda)"

Optional (only if you still want Ollama backend):

ollama pull gemma4:e2b

2) Inference (save raw predictions to CSV)

Smoke test (5 samples):

Hugging Face backend (default):

python run_pubmedqa_eval.py --backend hf --hf_model_id google/gemma-2-2b-it --hf_device auto --config pqa_labeled --split train --limit 5

Ollama backend:

python run_pubmedqa_eval.py --backend ollama --model gemma4:e2b --config pqa_labeled --split train --limit 5

Larger run (4000 samples):

python run_pubmedqa_eval.py --backend hf --hf_model_id google/gemma-2-2b-it --hf_device auto --config pqa_artificial --split train --limit 4000

Optional custom output path:

python run_pubmedqa_eval.py --limit 5 --output_csv outputs/my_predictions.csv

3) Analysis

python analyze_results.py --input_csv outputs/my_predictions.csv --output_dir outputs --device auto

Artifacts:

  • outputs/confusion_matrix.csv
  • outputs/per_sample_scores.csv
  • outputs/analysis_summary.json

4) Notes

  • Prompt includes both question and context.
  • final_decision is normalized to yes/no/maybe.
  • long_answer is compared using ROUGE-L, BERTScore, and embedding distance.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors