A reproducible evaluation framework for assessing large language models on Portuguese multiple-choice questions from ENADE (Exame Nacional de Desempenho dos Estudantes).
ENADE-QA is a benchmark of 716 unique questions drawn from health-sciences ENADE exams (2013–2023), covering ten undergraduate programs: Biomedicine, Nursing, Pharmacy, Physiotherapy, Speech-Language Pathology, General Training, Medicine, Nutrition, Dentistry, and Psychology.
The framework implements six evaluation strategies (S1–S6) combining generation-based and log-probability scoring under zero-shot, one-shot, and few-shot prompting, with and without answer-option shuffling.
enade-qa-benchmark/
├── enade_benchmark/ # Python evaluation package
│ ├── dataset.py # dataset loading and filtering
│ ├── model_loader.py # HuggingFace model loader and API clients
│ ├── prompts.py # PromptBuilder — reads YAML prompt files
│ ├── evaluator.py # evaluation loop, scoring methods, answer extractor
│ ├── metrics.py # accuracy, BPC, strategy runner
│ ├── batch.py # Maritaca Batch API support
│ ├── batch_google.py # Google Gemini Batch API support
│ └── io.py # result saving and loading
│
├── prompts/ # editable prompt templates (no Python changes needed)
│ ├── default.yaml # instruction, system message, labels
│ └── few_shot_examples.yaml # examples for one-shot and few-shot strategies
│
├── configs/ # one YAML file per experiment
│ ├── example_sabia7b.yaml # example: local HuggingFace model
│ └── example_api_anthropic.yaml
│
├── enade_mcqa_hf/data/ # dataset files (CSV and JSONL)
│ ├── enade_mcqa_unique_clean.csv
│ └── enade_mcqa_unique_clean.jsonl
│
├── results/ # experiment outputs (JSON + CSV per run)
├── figures/ # generated plots
├── latex/ # generated LaTeX tables
│
├── run.py # main CLI
├── 01_tables.ipynb # analysis: accuracy tables, BPC, temporal validity
├── 02_figures.ipynb # analysis: position plots, per-experiment panels
├── MCQA_enade_analysis.ipynb # full analysis notebook (reference)
├── refact_dataset.ipynb # dataset construction and deduplication
├── requirements.txt
└── setup.py
python -m venv venv_enade
source venv_enade/bin/activate
pip install -r requirements.txt
pip install -e .
# Optional: API provider SDKs
pip install anthropic openai
pip install google-generativeai google-genai# Single experiment
python run.py configs/example_sabia7b.yaml
# All six evaluation strategies (generates Table 1)
python run.py configs/example_sabia7b.yaml --all-strategies
# Single strategy
python run.py configs/example_sabia7b.yaml --strategy S5
# Batch API (Maritaca / Google, async up to 24h)
python run.py configs/maritaca/all_strategies.yaml --batch-submit
python run.py configs/maritaca/all_strategies.yaml --batch-fetchAll parameters are defined in a YAML file — no Python edits required.
name: llama3_8b_zero_shot
description: "LLaMA-3.1-8B-Instruct, zero-shot generation"
dataset:
path: enade_mcqa_hf/data/enade_mcqa_unique_clean.csv
filter_images: true
filter_cannot_fix: true
anos: null # null = all years; e.g. [2019, 2022, 2023]
cursos: null
componentes: null
model:
name: "meta-llama/Llama-3.1-8B-Instruct"
type: causal_lm # causal_lm | seq2seq
hf_token: null # null = uses HF_TOKEN env variable
load_in_4bit: false
evaluation:
method: generation # generation | log_prob | first_token
shot_type: zero # zero | one | few
shuffle_alternatives: false
seed: 42
max_new_tokens: 10
temperature: null # null = greedy decoding
prompts:
dir: prompts
output:
dir: results/llama3_8b
save_intermediate: truename: gpt4o_mini_zero_shot
dataset:
path: enade_mcqa_hf/data/enade_mcqa_unique_clean.csv
filter_images: true
filter_cannot_fix: true
anos: null
cursos: null
componentes: null
model:
type: api
api:
provider: openai # anthropic | openai | maritaca | google | nvidia
model: gpt-4o-mini
api_key: null # null = reads from environment variable
max_tokens: 100
evaluation:
method: api
shot_type: zero
shuffle_alternatives: false
seed: 42
n_samples: 1
temperature: 0.0
prompts:
dir: prompts
output:
dir: results/gpt4o_mini
save_intermediate: trueprovider |
Example model | Environment variable |
|---|---|---|
anthropic |
claude-sonnet-4-6 |
ANTHROPIC_API_KEY |
openai |
gpt-4o-mini |
OPENAI_API_KEY |
maritaca |
sabia-3 |
MARITACA |
google |
gemini-2.5-flash |
GEMINI |
nvidia |
deepseek-ai/deepseek-v3 |
NVIDIA_API_KEY |
| # | Method | Shot | Shuffle | Description |
|---|---|---|---|---|
| S1 | generation | zero | ✗ | Zero-shot generation |
| S2 | generation | one | ✗ | One-shot generation |
| S3 | generation | few | ✗ | Few-shot generation (2 examples) |
| S4 | generation | zero | ✓ | Zero-shot shuffled — measures BPC |
| S5 | log_prob | zero | ✗ | Log-probability scoring, zero-shot |
| S6 | log_prob | one | ✗ | Log-probability scoring, one-shot |
BPC (Bias by Position of Choice, Wang et al. 2023): deviation of predicted-position distribution from uniform. Computed from S4. Stratified by number of answer options.
S5/S6 require local models (APIs do not expose log-probabilities).
All prompt templates live in prompts/ and can be edited without touching Python code.
Prepended to every question. {valid_letters} is replaced by the valid answer letters (e.g., A, B, C, D, E):
Leia a questão do ENADE abaixo e escolha a alternativa correta ({valid_letters}).
Responda no formato exato: "Resposta: X" — sem mais texto.
Sent as the system role to Anthropic, OpenAI, Maritaca, and Google APIs. {valid_letters} is replaced at runtime:
Você é um especialista em questões do ENADE (Exame Nacional de Desempenho dos Estudantes),
abrangendo ciências da saúde como medicina, enfermagem, farmácia, fisioterapia, nutrição,
odontologia, fonoaudiologia, psicologia e biomedicina, além de formação geral.
Analise cuidadosamente todas as alternativas antes de escolher.
Responda APENAS com a letra da alternativa correta ({valid_letters}).
Não inclua explicações, ponto final ou qualquer outro texto.
Two examples are used for one-shot (S2/S6) and few-shot (S3) strategies. Both are drawn from publicly available ENADE general-training items:
Example 1 — Literature / Social sciences (answer: D)
Retrato de uma princesa desconhecida (Andresen, 2004)
"No poema, a autora sugere que"
→ D. o trabalho compulsório de escravos proporcionou privilégios aos príncipes.
Example 2 — Biology / Evolutionary theory (answer: C)
Asserções sobre evolução adaptativa e seleção natural
→ C. A primeira asserção é uma proposição verdadeira, e a segunda, uma proposição falsa.
The full question texts, alternatives, and correct answers are stored in prompts/few_shot_examples.yaml.
[system]: Você é um especialista em questões do ENADE ...
[user]:
Leia a questão do ENADE abaixo e escolha a alternativa correta (A, B, C, D, E).
Responda no formato exato: "Resposta: X" — sem mais texto.
Questão:
<enunciado da questão>
Alternativas:
A) ...
B) ...
C) ...
D) ...
E) ...
For one-shot/few-shot strategies, one or two complete question–answer pairs are prepended before the target question, each bracketed by Exemplo: / Resposta: X and Agora responda:.
| Method | Description | Recommended for |
|---|---|---|
generation |
Generates the answer letter; cascaded regex extraction | Instruction-tuned models |
log_prob |
Mean log-prob of each option's tokens given the stem | Base (non-instruct) models |
first_token |
Probability of A/B/C/D/E as the next token | Fast alternative for causal LMs |
api |
External API call; same prompt as generation |
Anthropic, OpenAI, Maritaca, Google |
Five regular expressions are applied in priority order. A fallback accepts the last valid letter (A–E) in the output, recovering answers from models that generate reasoning chains before the final letter. Outputs with no valid letter are marked unanswered.
Across 30,788 generation-method evaluations in this study: 93.3% matched a primary pattern, 6.4% used the fallback, 0.19% were unanswered.
Configs with run_all_strategies: true automatically use --all-strategies behavior:
python run.py configs/maritaca/all_strategies.yaml --batch-submit
python run.py configs/maritaca/all_strategies.yaml --batch-fetch
python run.py configs/gemini25_flash/all_strategies.yaml --batch-submit
python run.py configs/gemini25_flash/all_strategies.yaml --batch-fetchEach run writes to output.dir:
{name}_results.json— per-question results with subgroup breakdowns (year, course, component){name}_summary.csv— compact per-question table
Set save_intermediate: true to enable resuming interrupted runs via {name}_intermediate.jsonl.
01_tables.ipynb— accuracy tables (S1–S6), per-course breakdown, temporal validity, LaTeX export02_figures.ipynb— position-distribution plots, 4-panel per-experiment figures, comparison chart
Both notebooks skip result folders ending with _old automatically.
ENADE-QA is derived from publicly available ENADE exams (INEP/MEC). After filtering image-based and malformed questions: 716 questions, years 2013–2023, 10 programs, 687 five-option and 29 four-option items. Dataset files are included under enade_mcqa_hf/data/.
- Brown et al. (2020). Language Models are Few-Shot Learners. NeurIPS.
- Holtzman et al. (2021). Surface Form Competition. EMNLP.
- Wang et al. (2023). Large Language Models Are Not Robust Multiple Choice Selectors. ICLR.
- Lu et al. (2022). Fantastically Ordered Prompts and Where to Find Them. ACL.
- Chen et al. (2021). Evaluating Large Language Models Trained on Code. arXiv.
Code: MIT License. Dataset: derived from public-domain ENADE exams (INEP/MEC); released under CC BY-NC 4.0.