Skip to content

H2IA/ENADE-QA

Repository files navigation

ENADE-QA Benchmark

A reproducible evaluation framework for assessing large language models on Portuguese multiple-choice questions from ENADE (Exame Nacional de Desempenho dos Estudantes).

Overview

ENADE-QA is a benchmark of 716 unique questions drawn from health-sciences ENADE exams (2013–2023), covering ten undergraduate programs: Biomedicine, Nursing, Pharmacy, Physiotherapy, Speech-Language Pathology, General Training, Medicine, Nutrition, Dentistry, and Psychology.

The framework implements six evaluation strategies (S1–S6) combining generation-based and log-probability scoring under zero-shot, one-shot, and few-shot prompting, with and without answer-option shuffling.

Repository Structure

enade-qa-benchmark/
├── enade_benchmark/            # Python evaluation package
│   ├── dataset.py              # dataset loading and filtering
│   ├── model_loader.py         # HuggingFace model loader and API clients
│   ├── prompts.py              # PromptBuilder — reads YAML prompt files
│   ├── evaluator.py            # evaluation loop, scoring methods, answer extractor
│   ├── metrics.py              # accuracy, BPC, strategy runner
│   ├── batch.py                # Maritaca Batch API support
│   ├── batch_google.py         # Google Gemini Batch API support
│   └── io.py                   # result saving and loading
│
├── prompts/                    # editable prompt templates (no Python changes needed)
│   ├── default.yaml            # instruction, system message, labels
│   └── few_shot_examples.yaml  # examples for one-shot and few-shot strategies
│
├── configs/                    # one YAML file per experiment
│   ├── example_sabia7b.yaml    # example: local HuggingFace model
│   └── example_api_anthropic.yaml
│
├── enade_mcqa_hf/data/         # dataset files (CSV and JSONL)
│   ├── enade_mcqa_unique_clean.csv
│   └── enade_mcqa_unique_clean.jsonl
│
├── results/                    # experiment outputs (JSON + CSV per run)
├── figures/                    # generated plots
├── latex/                      # generated LaTeX tables
│
├── run.py                      # main CLI
├── 01_tables.ipynb             # analysis: accuracy tables, BPC, temporal validity
├── 02_figures.ipynb            # analysis: position plots, per-experiment panels
├── MCQA_enade_analysis.ipynb   # full analysis notebook (reference)
├── refact_dataset.ipynb        # dataset construction and deduplication
├── requirements.txt
└── setup.py

Installation

python -m venv venv_enade
source venv_enade/bin/activate

pip install -r requirements.txt
pip install -e .

# Optional: API provider SDKs
pip install anthropic openai
pip install google-generativeai google-genai

Quick Start

# Single experiment
python run.py configs/example_sabia7b.yaml

# All six evaluation strategies (generates Table 1)
python run.py configs/example_sabia7b.yaml --all-strategies

# Single strategy
python run.py configs/example_sabia7b.yaml --strategy S5

# Batch API (Maritaca / Google, async up to 24h)
python run.py configs/maritaca/all_strategies.yaml --batch-submit
python run.py configs/maritaca/all_strategies.yaml --batch-fetch

Experiment Configuration

All parameters are defined in a YAML file — no Python edits required.

Local HuggingFace Model

name: llama3_8b_zero_shot
description: "LLaMA-3.1-8B-Instruct, zero-shot generation"

dataset:
  path: enade_mcqa_hf/data/enade_mcqa_unique_clean.csv
  filter_images: true
  filter_cannot_fix: true
  anos: null        # null = all years; e.g. [2019, 2022, 2023]
  cursos: null
  componentes: null

model:
  name: "meta-llama/Llama-3.1-8B-Instruct"
  type: causal_lm   # causal_lm | seq2seq
  hf_token: null    # null = uses HF_TOKEN env variable
  load_in_4bit: false

evaluation:
  method: generation   # generation | log_prob | first_token
  shot_type: zero      # zero | one | few
  shuffle_alternatives: false
  seed: 42
  max_new_tokens: 10
  temperature: null    # null = greedy decoding

prompts:
  dir: prompts

output:
  dir: results/llama3_8b
  save_intermediate: true

API Model

name: gpt4o_mini_zero_shot

dataset:
  path: enade_mcqa_hf/data/enade_mcqa_unique_clean.csv
  filter_images: true
  filter_cannot_fix: true
  anos: null
  cursos: null
  componentes: null

model:
  type: api

api:
  provider: openai   # anthropic | openai | maritaca | google | nvidia
  model: gpt-4o-mini
  api_key: null      # null = reads from environment variable
  max_tokens: 100

evaluation:
  method: api
  shot_type: zero
  shuffle_alternatives: false
  seed: 42
  n_samples: 1
  temperature: 0.0

prompts:
  dir: prompts

output:
  dir: results/gpt4o_mini
  save_intermediate: true

API Providers

provider Example model Environment variable
anthropic claude-sonnet-4-6 ANTHROPIC_API_KEY
openai gpt-4o-mini OPENAI_API_KEY
maritaca sabia-3 MARITACA
google gemini-2.5-flash GEMINI
nvidia deepseek-ai/deepseek-v3 NVIDIA_API_KEY

Evaluation Strategies (S1–S6)

# Method Shot Shuffle Description
S1 generation zero Zero-shot generation
S2 generation one One-shot generation
S3 generation few Few-shot generation (2 examples)
S4 generation zero Zero-shot shuffled — measures BPC
S5 log_prob zero Log-probability scoring, zero-shot
S6 log_prob one Log-probability scoring, one-shot

BPC (Bias by Position of Choice, Wang et al. 2023): deviation of predicted-position distribution from uniform. Computed from S4. Stratified by number of answer options.

S5/S6 require local models (APIs do not expose log-probabilities).

Prompts

All prompt templates live in prompts/ and can be edited without touching Python code.

Instruction template (generation, first_token, api methods)

Prepended to every question. {valid_letters} is replaced by the valid answer letters (e.g., A, B, C, D, E):

Leia a questão do ENADE abaixo e escolha a alternativa correta ({valid_letters}).
Responda no formato exato: "Resposta: X" — sem mais texto.

System message (API providers only)

Sent as the system role to Anthropic, OpenAI, Maritaca, and Google APIs. {valid_letters} is replaced at runtime:

Você é um especialista em questões do ENADE (Exame Nacional de Desempenho dos Estudantes),
abrangendo ciências da saúde como medicina, enfermagem, farmácia, fisioterapia, nutrição,
odontologia, fonoaudiologia, psicologia e biomedicina, além de formação geral.
Analise cuidadosamente todas as alternativas antes de escolher.
Responda APENAS com a letra da alternativa correta ({valid_letters}).
Não inclua explicações, ponto final ou qualquer outro texto.

Few-shot examples (prompts/few_shot_examples.yaml)

Two examples are used for one-shot (S2/S6) and few-shot (S3) strategies. Both are drawn from publicly available ENADE general-training items:

Example 1 — Literature / Social sciences (answer: D)

Retrato de uma princesa desconhecida (Andresen, 2004)
"No poema, a autora sugere que"
D. o trabalho compulsório de escravos proporcionou privilégios aos príncipes.

Example 2 — Biology / Evolutionary theory (answer: C)

Asserções sobre evolução adaptativa e seleção natural
C. A primeira asserção é uma proposição verdadeira, e a segunda, uma proposição falsa.

The full question texts, alternatives, and correct answers are stored in prompts/few_shot_examples.yaml.

Prompt format (zero-shot generation example)

[system]: Você é um especialista em questões do ENADE ...

[user]:
Leia a questão do ENADE abaixo e escolha a alternativa correta (A, B, C, D, E).
Responda no formato exato: "Resposta: X" — sem mais texto.

Questão:
<enunciado da questão>

Alternativas:
A) ...
B) ...
C) ...
D) ...
E) ...

For one-shot/few-shot strategies, one or two complete question–answer pairs are prepended before the target question, each bracketed by Exemplo: / Resposta: X and Agora responda:.

Scoring Methods

Method Description Recommended for
generation Generates the answer letter; cascaded regex extraction Instruction-tuned models
log_prob Mean log-prob of each option's tokens given the stem Base (non-instruct) models
first_token Probability of A/B/C/D/E as the next token Fast alternative for causal LMs
api External API call; same prompt as generation Anthropic, OpenAI, Maritaca, Google

Answer Extraction

Five regular expressions are applied in priority order. A fallback accepts the last valid letter (A–E) in the output, recovering answers from models that generate reasoning chains before the final letter. Outputs with no valid letter are marked unanswered.

Across 30,788 generation-method evaluations in this study: 93.3% matched a primary pattern, 6.4% used the fallback, 0.19% were unanswered.

Batch API

Configs with run_all_strategies: true automatically use --all-strategies behavior:

python run.py configs/maritaca/all_strategies.yaml --batch-submit
python run.py configs/maritaca/all_strategies.yaml --batch-fetch

python run.py configs/gemini25_flash/all_strategies.yaml --batch-submit
python run.py configs/gemini25_flash/all_strategies.yaml --batch-fetch

Result Files

Each run writes to output.dir:

  • {name}_results.json — per-question results with subgroup breakdowns (year, course, component)
  • {name}_summary.csv — compact per-question table

Set save_intermediate: true to enable resuming interrupted runs via {name}_intermediate.jsonl.

Analysis Notebooks

  • 01_tables.ipynb — accuracy tables (S1–S6), per-course breakdown, temporal validity, LaTeX export
  • 02_figures.ipynb — position-distribution plots, 4-panel per-experiment figures, comparison chart

Both notebooks skip result folders ending with _old automatically.

Dataset

ENADE-QA is derived from publicly available ENADE exams (INEP/MEC). After filtering image-based and malformed questions: 716 questions, years 2013–2023, 10 programs, 687 five-option and 29 four-option items. Dataset files are included under enade_mcqa_hf/data/.

References

  • Brown et al. (2020). Language Models are Few-Shot Learners. NeurIPS.
  • Holtzman et al. (2021). Surface Form Competition. EMNLP.
  • Wang et al. (2023). Large Language Models Are Not Robust Multiple Choice Selectors. ICLR.
  • Lu et al. (2022). Fantastically Ordered Prompts and Where to Find Them. ACL.
  • Chen et al. (2021). Evaluating Large Language Models Trained on Code. arXiv.

License

Code: MIT License. Dataset: derived from public-domain ENADE exams (INEP/MEC); released under CC BY-NC 4.0.

About

ENADE-QA: A Brazilian Portuguese Dataset for Benchmarking Large Language Models on Health-Related Higher Education Exams

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors