PubMedQA benchmark (Hugging Face / Ollama)

1) Setup

python3 -m venv .venv
source .venv/bin/activate
# For GPU runs, prefer a CUDA-enabled PyTorch wheel:
pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

GPU check:

python -c "import torch; print('cuda_available=', torch.cuda.is_available()); print('torch_cuda=', torch.version.cuda)"

Optional (only if you still want Ollama backend):

ollama pull gemma4:e2b

2) Inference (save raw predictions to CSV)

Smoke test (5 samples):

Hugging Face backend (default):

python run_pubmedqa_eval.py --backend hf --hf_model_id google/gemma-2-2b-it --hf_device auto --config pqa_labeled --split train --limit 5

Ollama backend:

python run_pubmedqa_eval.py --backend ollama --model gemma4:e2b --config pqa_labeled --split train --limit 5

Larger run (4000 samples):

python run_pubmedqa_eval.py --backend hf --hf_model_id google/gemma-2-2b-it --hf_device auto --config pqa_artificial --split train --limit 4000

Optional custom output path:

python run_pubmedqa_eval.py --limit 5 --output_csv outputs/my_predictions.csv

3) Analysis

python analyze_results.py --input_csv outputs/my_predictions.csv --output_dir outputs --device auto

Artifacts:

outputs/confusion_matrix.csv
outputs/per_sample_scores.csv
outputs/analysis_summary.json

4) Notes

Prompt includes both question and context.
final_decision is normalized to yes/no/maybe.
long_answer is compared using ROUGE-L, BERTScore, and embedding distance.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
outputs		outputs
.gitignore		.gitignore
LOTR_Model_Eval.ipynb		LOTR_Model_Eval.ipynb
README.md		README.md
Rag_Eval.ipynb		Rag_Eval.ipynb
The Lord of the Rings.txt		The Lord of the Rings.txt
analyze_results.py		analyze_results.py
lotr_trivia.csv		lotr_trivia.csv
requirements.txt		requirements.txt
run_pubmedqa_eval.py		run_pubmedqa_eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PubMedQA benchmark (Hugging Face / Ollama)

1) Setup

2) Inference (save raw predictions to CSV)

3) Analysis

4) Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PubMedQA benchmark (Hugging Face / Ollama)

1) Setup

2) Inference (save raw predictions to CSV)

3) Analysis

4) Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages