ENADE-QA Benchmark

A reproducible evaluation framework for assessing large language models on Portuguese multiple-choice questions from ENADE (Exame Nacional de Desempenho dos Estudantes).

Overview

ENADE-QA is a benchmark of 716 unique questions drawn from health-sciences ENADE exams (2013–2023), covering ten undergraduate programs: Biomedicine, Nursing, Pharmacy, Physiotherapy, Speech-Language Pathology, General Training, Medicine, Nutrition, Dentistry, and Psychology.

The framework implements six evaluation strategies (S1–S6) combining generation-based and log-probability scoring under zero-shot, one-shot, and few-shot prompting, with and without answer-option shuffling.

Repository Structure

enade-qa-benchmark/
├── enade_benchmark/            # Python evaluation package
│   ├── dataset.py              # dataset loading and filtering
│   ├── model_loader.py         # HuggingFace model loader and API clients
│   ├── prompts.py              # PromptBuilder — reads YAML prompt files
│   ├── evaluator.py            # evaluation loop, scoring methods, answer extractor
│   ├── metrics.py              # accuracy, BPC, strategy runner
│   ├── batch.py                # Maritaca Batch API support
│   ├── batch_google.py         # Google Gemini Batch API support
│   └── io.py                   # result saving and loading
│
├── prompts/                    # editable prompt templates (no Python changes needed)
│   ├── default.yaml            # instruction, system message, labels
│   └── few_shot_examples.yaml  # examples for one-shot and few-shot strategies
│
├── configs/                    # one YAML file per experiment
│   ├── example_sabia7b.yaml    # example: local HuggingFace model
│   └── example_api_anthropic.yaml
│
├── enade_mcqa_hf/data/         # dataset files (CSV and JSONL)
│   ├── enade_mcqa_unique_clean.csv
│   └── enade_mcqa_unique_clean.jsonl
│
├── results/                    # experiment outputs (JSON + CSV per run)
├── figures/                    # generated plots
├── latex/                      # generated LaTeX tables
│
├── run.py                      # main CLI
├── 01_tables.ipynb             # analysis: accuracy tables, BPC, temporal validity
├── 02_figures.ipynb            # analysis: position plots, per-experiment panels
├── MCQA_enade_analysis.ipynb   # full analysis notebook (reference)
├── refact_dataset.ipynb        # dataset construction and deduplication
├── requirements.txt
└── setup.py

Installation

python -m venv venv_enade
source venv_enade/bin/activate

pip install -r requirements.txt
pip install -e .

# Optional: API provider SDKs
pip install anthropic openai
pip install google-generativeai google-genai

Quick Start

# Single experiment
python run.py configs/example_sabia7b.yaml

# All six evaluation strategies (generates Table 1)
python run.py configs/example_sabia7b.yaml --all-strategies

# Single strategy
python run.py configs/example_sabia7b.yaml --strategy S5

# Batch API (Maritaca / Google, async up to 24h)
python run.py configs/maritaca/all_strategies.yaml --batch-submit
python run.py configs/maritaca/all_strategies.yaml --batch-fetch

Experiment Configuration

All parameters are defined in a YAML file — no Python edits required.

Local HuggingFace Model

name: llama3_8b_zero_shot
description: "LLaMA-3.1-8B-Instruct, zero-shot generation"

dataset:
  path: enade_mcqa_hf/data/enade_mcqa_unique_clean.csv
  filter_images: true
  filter_cannot_fix: true
  anos: null        # null = all years; e.g. [2019, 2022, 2023]
  cursos: null
  componentes: null

model:
  name: "meta-llama/Llama-3.1-8B-Instruct"
  type: causal_lm   # causal_lm | seq2seq
  hf_token: null    # null = uses HF_TOKEN env variable
  load_in_4bit: false

evaluation:
  method: generation   # generation | log_prob | first_token
  shot_type: zero      # zero | one | few
  shuffle_alternatives: false
  seed: 42
  max_new_tokens: 10
  temperature: null    # null = greedy decoding

prompts:
  dir: prompts

output:
  dir: results/llama3_8b
  save_intermediate: true

API Model

name: gpt4o_mini_zero_shot

dataset:
  path: enade_mcqa_hf/data/enade_mcqa_unique_clean.csv
  filter_images: true
  filter_cannot_fix: true
  anos: null
  cursos: null
  componentes: null

model:
  type: api

api:
  provider: openai   # anthropic | openai | maritaca | google | nvidia
  model: gpt-4o-mini
  api_key: null      # null = reads from environment variable
  max_tokens: 100

evaluation:
  method: api
  shot_type: zero
  shuffle_alternatives: false
  seed: 42
  n_samples: 1
  temperature: 0.0

prompts:
  dir: prompts

output:
  dir: results/gpt4o_mini
  save_intermediate: true

API Providers

`provider`	Example model	Environment variable
`anthropic`	`claude-sonnet-4-6`	`ANTHROPIC_API_KEY`
`openai`	`gpt-4o-mini`	`OPENAI_API_KEY`
`maritaca`	`sabia-3`	`MARITACA`
`google`	`gemini-2.5-flash`	`GEMINI`
`nvidia`	`deepseek-ai/deepseek-v3`	`NVIDIA_API_KEY`

Evaluation Strategies (S1–S6)

#	Method	Shot	Shuffle	Description
S1	generation	zero	✗	Zero-shot generation
S2	generation	one	✗	One-shot generation
S3	generation	few	✗	Few-shot generation (2 examples)
S4	generation	zero	✓	Zero-shot shuffled — measures BPC
S5	log_prob	zero	✗	Log-probability scoring, zero-shot
S6	log_prob	one	✗	Log-probability scoring, one-shot

BPC (Bias by Position of Choice, Wang et al. 2023): deviation of predicted-position distribution from uniform. Computed from S4. Stratified by number of answer options.

S5/S6 require local models (APIs do not expose log-probabilities).

Prompts

All prompt templates live in prompts/ and can be edited without touching Python code.

Instruction template (`generation`, `first_token`, `api` methods)

Prepended to every question. {valid_letters} is replaced by the valid answer letters (e.g., A, B, C, D, E):

Leia a questão do ENADE abaixo e escolha a alternativa correta ({valid_letters}).
Responda no formato exato: "Resposta: X" — sem mais texto.

System message (API providers only)

Sent as the system role to Anthropic, OpenAI, Maritaca, and Google APIs. {valid_letters} is replaced at runtime:

Você é um especialista em questões do ENADE (Exame Nacional de Desempenho dos Estudantes),
abrangendo ciências da saúde como medicina, enfermagem, farmácia, fisioterapia, nutrição,
odontologia, fonoaudiologia, psicologia e biomedicina, além de formação geral.
Analise cuidadosamente todas as alternativas antes de escolher.
Responda APENAS com a letra da alternativa correta ({valid_letters}).
Não inclua explicações, ponto final ou qualquer outro texto.

Few-shot examples (`prompts/few_shot_examples.yaml`)

Two examples are used for one-shot (S2/S6) and few-shot (S3) strategies. Both are drawn from publicly available ENADE general-training items:

Example 1 — Literature / Social sciences (answer: D)

Retrato de uma princesa desconhecida (Andresen, 2004)
"No poema, a autora sugere que"
→ D. o trabalho compulsório de escravos proporcionou privilégios aos príncipes.

Example 2 — Biology / Evolutionary theory (answer: C)

Asserções sobre evolução adaptativa e seleção natural
→ C. A primeira asserção é uma proposição verdadeira, e a segunda, uma proposição falsa.

The full question texts, alternatives, and correct answers are stored in prompts/few_shot_examples.yaml.

Prompt format (zero-shot generation example)

[system]: Você é um especialista em questões do ENADE ...

[user]:
Leia a questão do ENADE abaixo e escolha a alternativa correta (A, B, C, D, E).
Responda no formato exato: "Resposta: X" — sem mais texto.

Questão:
<enunciado da questão>

Alternativas:
A) ...
B) ...
C) ...
D) ...
E) ...

For one-shot/few-shot strategies, one or two complete question–answer pairs are prepended before the target question, each bracketed by Exemplo: / Resposta: X and Agora responda:.

Scoring Methods

Method	Description	Recommended for
`generation`	Generates the answer letter; cascaded regex extraction	Instruction-tuned models
`log_prob`	Mean log-prob of each option's tokens given the stem	Base (non-instruct) models
`first_token`	Probability of A/B/C/D/E as the next token	Fast alternative for causal LMs
`api`	External API call; same prompt as `generation`	Anthropic, OpenAI, Maritaca, Google

Answer Extraction

Five regular expressions are applied in priority order. A fallback accepts the last valid letter (A–E) in the output, recovering answers from models that generate reasoning chains before the final letter. Outputs with no valid letter are marked unanswered.

Across 30,788 generation-method evaluations in this study: 93.3% matched a primary pattern, 6.4% used the fallback, 0.19% were unanswered.

Batch API

Configs with run_all_strategies: true automatically use --all-strategies behavior:

python run.py configs/maritaca/all_strategies.yaml --batch-submit
python run.py configs/maritaca/all_strategies.yaml --batch-fetch

python run.py configs/gemini25_flash/all_strategies.yaml --batch-submit
python run.py configs/gemini25_flash/all_strategies.yaml --batch-fetch

Result Files

Each run writes to output.dir:

{name}_results.json — per-question results with subgroup breakdowns (year, course, component)
{name}_summary.csv — compact per-question table

Set save_intermediate: true to enable resuming interrupted runs via {name}_intermediate.jsonl.

Analysis Notebooks

01_tables.ipynb — accuracy tables (S1–S6), per-course breakdown, temporal validity, LaTeX export
02_figures.ipynb — position-distribution plots, 4-panel per-experiment figures, comparison chart

Both notebooks skip result folders ending with _old automatically.

Dataset

ENADE-QA is derived from publicly available ENADE exams (INEP/MEC). After filtering image-based and malformed questions: 716 questions, years 2013–2023, 10 programs, 687 five-option and 29 four-option items. Dataset files are included under enade_mcqa_hf/data/.

References

Brown et al. (2020). Language Models are Few-Shot Learners. NeurIPS.
Holtzman et al. (2021). Surface Form Competition. EMNLP.
Wang et al. (2023). Large Language Models Are Not Robust Multiple Choice Selectors. ICLR.
Lu et al. (2022). Fantastically Ordered Prompts and Where to Find Them. ACL.
Chen et al. (2021). Evaluating Large Language Models Trained on Code. arXiv.

License

Code: MIT License. Dataset: derived from public-domain ENADE exams (INEP/MEC); released under CC BY-NC 4.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ENADE-QA Benchmark

Overview

Repository Structure

Installation

Quick Start

Experiment Configuration

Local HuggingFace Model

API Model

API Providers

Evaluation Strategies (S1–S6)

Prompts

Instruction template (`generation`, `first_token`, `api` methods)

System message (API providers only)

Few-shot examples (`prompts/few_shot_examples.yaml`)

Prompt format (zero-shot generation example)

Scoring Methods

Answer Extraction

Batch API

Result Files

Analysis Notebooks

Dataset

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
enade_benchmark		enade_benchmark
enade_mcqa_hf		enade_mcqa_hf
prompts		prompts
.gitignore		.gitignore
01_tables.ipynb		01_tables.ipynb
02_figures.ipynb		02_figures.ipynb
MCQA_enade_analysis.ipynb		MCQA_enade_analysis.ipynb
MCQA_enade_benchmark.ipynb		MCQA_enade_benchmark.ipynb
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ENADE-QA Benchmark

Overview

Repository Structure

Installation

Quick Start

Experiment Configuration

Local HuggingFace Model

API Model

API Providers

Evaluation Strategies (S1–S6)

Prompts

Instruction template (generation, first_token, api methods)

System message (API providers only)

Few-shot examples (prompts/few_shot_examples.yaml)

Prompt format (zero-shot generation example)

Scoring Methods

Answer Extraction

Batch API

Result Files

Analysis Notebooks

Dataset

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Instruction template (`generation`, `first_token`, `api` methods)

Few-shot examples (`prompts/few_shot_examples.yaml`)

Packages