This repository provides code for reproducing the experiments from the paper. It is organized into two main sections, with the relevant documentation linked below:
- Controlled experiments:
- Generate synthetic data
- Run controlled experiments (this README)
- Natural experiments:
The recommended setup is via the pre-built Docker image, available on DockerHub:
docker pull anjasur/factgap:amd64-cuda-root-latestSee the image page on DockerHub anjasur/factgap.
If you would rather build the image yourself, see the Dockerfile in the installation/ folder.
This README outlines how to run the controlled experiments. All three main scripts (training, inference and grading) are driven by Hydra configs from
src/origins/configs/. The config is selected with --config-name and any field can be overridden from the command line.
| Script | Purpose | Backend | Stage config |
|---|---|---|---|
src/origins/main.py |
Training and loss computation | transformers + accelerate (DeepSpeed) |
train.yaml |
src/origins/main_vllm.py |
Fast inference over one or more checkpoints | vLLM |
inference.yaml |
src/origins/main_grading.py |
Grade model outputs using external LLM APIs | OpenAI / Gemini APIs | grade.yaml |
The three stage configs use default_config.yaml as a base and override the relevant flags for the chosen stage.
accelerate launch --config_file accelerate_configs/train_launch.yaml \
src/origins/main.py --config-name trainpython3 src/origins/main_vllm.py --config-name inference \
infer.checkpoint_params.checkpoint_path=outputs/debug/experiment_default_trainPoint infer.checkpoint_params.checkpoint_path at the directory of a previous training run — with default settings this is outputs/debug/experiment_default_train (see the Outputs section for how this path is constructed):
python3 src/origins/main_grading.py --config-name grade \
grading.results_dir=outputs/debug/experiment_default_inferPoint grading.results_dir at the directory of a previous inference run — with default settings this is outputs/debug/experiment_default_infer:
Any Hydra override is simply appended to the command:
python3 src/origins/main_vllm.py --config-name inference \
infer.checkpoint_params.checkpoint_path=outputs/v12_gemma3_4b/lr1e-5 \
infer.checkpoint_params.start_epoch=5 \
infer.checkpoint_params.end_epoch=20Every run of main.py, main_vllm.py, or main_grading.py creates a results directory under outputs/. Each invocation creates its own timestamped subdirectory under hydra.run.dir, where results, logs and a Hydra config snapshot are written.
Override output.experiment_group and output.experiment_name to control where results land. The sweep configs generated by src/origins/prepare_sweeps/prepare_experiments.py set these automatically per grid point.
Each stage — training, inference, grading — has its own stage config in src/origins/configs/ (train.yaml, inference.yaml, grade.yaml) that:
- inherits the full schema from
default_config.yaml, and - overrides the fields needed for that stage — at minimum the stage flag (
train.enable_training,infer.enable_inference, orgrading.enable_grading), so exactly one stage is active per run.
The YAML blocks below show, for each stage, the overrides you typically need on top of default_config.yaml. Anything not listed inherits its default from default_config.yaml.
--config-name train
Overrides on top of default_config.yaml for a training-only run:
train:
enable_training: true # stage flag — must be true for training
gradient_checkpointing: false # set to true if running out of GPU memory
num_train_epochs: 20 # number of epochs to train for
learning_rate: 5e-6
per_device_train_batch_size: 16
gradient_accumulation_steps: 1 # needs to match the value in the accelerate config--config-name inference
Overrides on top of default_config.yaml for an inference-only run:
train:
enable_training: false # stage flag — disable training
infer:
enable_inference: true # stage flag — must be true for inference
temperature: 0.3
num_return_sequences: 1
checkpoint_params:
# Set to outputs/<experiment_group>/<run> from a training job.
# main_vllm.py loads <checkpoint_path>/model_after_epoch_<i> for each i in
# [start_epoch, end_epoch] (step `frequency`).
checkpoint_path: null
start_epoch: 0
end_epoch: 20
frequency: 1--config-name grade
main_grading.py reads model outputs produced by the inference stage and sends them to an external LLM to be graded.
Create an api_keys.yaml file (path configurable via grading.api_key_file_path). Each top-level key must be one of the provider strings defined by the ProviderAPI enum in src/origins/models_api/adapters.py. main_grading.py infers the provider from
grading.model (or uses grading.provider if set) and looks up the matching key here:
openai: "sk-..."
google: "AIza-..."Overrides on top of default_config.yaml for a grading-only run:
train:
enable_training: false # stage flag — disable training
grading:
enable_grading: true # stage flag — must be true for grading
results_dir: outputs/<experiment_group>/<run> # set to results output folder produced by main_vllm.py
start_epoch: 0 # inclusive
end_epoch: 20 # inclusive
# Grader model — provider is auto-inferred from the model name; set `provider` to override.
model: gemini-3.1-flash-lite
provider: null
api_key_file_path: "api_keys.yaml"
# Generation parameters forwarded to the grader LM.
temperature: 0.3
max_tokens: 512
reasoning_effort: minimal # none / minimal / low / medium / high
# Throughput / robustness.
max_concurrent_requests: 10
max_retries: 5
skip_model_lookup: false
# Jinja2 template for the grader prompt.
grading_template_name: "api_grading.j2"
# Quick-check subsets.
debug_mode: false
debug_num_samples: 1000
debug_num_unique_tasks: 50After running the experiments, you can plot the results (generation and verification accuracy) using src/origins/notebooks/plot_results.ipynb.
The notebook contains one section per phase (acquisition / update / continual learning).
The experiments/ directory contains the full set of pre-generated configs and launch scripts used in the paper, organised by phase and model. The three phases map to the following directory prefixes:
| Paper phase | Directory prefix | Example |
|---|---|---|
| Acquisition | learn_* |
experiments/learn_gemma3_4b/ |
| Update | update_* |
experiments/update_gemma3_4b/ |
| Continual learning (forget) | forget_* |
experiments/forget_gemma3_4b_epoch6/ |
forget_* (continual-learning) recipes carry an extra _epoch<N> suffix that controls when continual training is launched relative to the acquisition run: epoch<N> means the continual phase starts N epochs after both capabilities — generation and verification — have saturated during the acquisition run. So forget_gemma3_4b_epoch0 resumes from the saturation epoch itself, forget_gemma3_4b_epoch6 resumes 6 epochs later, etc.
Continual learning data prerequisite. Before running any
forget_* recipe, prepare the Wikipedia T-REx dataset by executing all cells ofsrc/origins/notebooks/prepare_continual_training_data.ipynb. The notebook downloads wiki data triplets and writes the resulting files todata/control_wiki/, from which the continual-learning training experiments load the data from.
Each parent phase directory in experiments/ bundles four sub-stages, distinguished by suffix:
| Suffix | Stage | Entry point |
|---|---|---|
_train |
Training (acquire / update / forget) | src/origins/main.py (Accelerate + DeepSpeed) |
_infer |
Per-checkpoint vLLM inference | src/origins/main_vllm.py |
_grade |
API-based grading. | src/origins/main_grading.py |
_val_target |
Loss calculation | src/origins/main.py |
The parent directory (e.g. experiments/learn_gemma3_4b/) chains _train → _infer in a single all_jobs.sh; the _grade and _val_target stages are launched separately from their own all_jobs.sh.
Models covered: gemma3_{4b,12b}, llama32_{3b,11b}, qwen3_{4b,14b}, msft4_{4b,14b} (Microsoft Phi-4 models).
Trains the base model from scratch on the synthetic dataset. This is the first phase of the pipeline; update and forget experiments resume from its checkpoints.
bash experiments/learn_gemma3_4b/all_jobs.sh # train + infer
bash experiments/learn_gemma3_4b_grade/all_jobs.sh # grade
bash experiments/learn_gemma3_4b_val_target/all_jobs.sh # compute loss Continues training from a fixed learn-phase checkpoint on updated datapoints.
bash experiments/update_gemma3_4b/all_jobs.sh # train + infer
bash experiments/update_gemma3_4b_grade/all_jobs.sh # gradeContinues training from a learn-phase checkpoint on an unrelated dataset.
bash experiments/forget_gemma3_4b_epoch0/all_jobs.sh # train + infer
bash experiments/forget_gemma3_4b_epoch0_grade/all_jobs.sh # gradeTo author new experiments for any of these phases, see Running new experiments and sweeps below.
# Multi-GPU with accelerate (set --num_processes to the number of GPUs)
accelerate launch --config_file accelerate_configs/train_launch.yaml \
--num_processes 4 src/origins/main.py --config-name train
# Any Hydra override can be appended
accelerate launch --config_file accelerate_configs/train_launch.yaml \
--num_processes 4 src/origins/main.py --config-name train \
train.per_device_train_batch_size=2When you tweak gradient_accumulation_steps, two values must agree at runtime:
cfg.train.gradient_accumulation_steps— the Hydra config consumed insidemain.py, andaccelerator.gradient_accumulation_steps— the value Accelerate hands to the liveAccelerator, read from the launcher's Accelerate / DeepSpeed config file (e.g.accelerate_configs/train_launch.yaml) and optionally overridden by the--gradient_accumulation_stepsCLI flag.
For experiments with grids of hyperparameters, define a single sweep YAML under src/origins/configs/experiments/ and use src/origins/prepare_sweeps/prepare_experiments.py to expand it into stage-specific configs and launch scripts.
A sweep file declares the grid under hydra.sweeper.params. For example:
# @package _global_
defaults:
- /defaults/train_small
- _self_
model:
name: google/gemma-3-4b-it
attn_implementation: "eager"
train:
per_device_train_batch_size: 8
gradient_accumulation_steps: 1
hydra:
sweeper:
params:
train.learning_rate: 1e-7, 1e-6, 1e-5Then expand it:
python3 src/origins/prepare_sweeps/prepare_experiments.py \
--sweep_file_path <path_to_sweep_config_file>.yaml \
--train_num_gpus 4 \
--infer_num_gpus 4 \
--train_accelerate_config train_launchThis generates four folders under experiments/: one for individual train, infer and grade phases and one for paired train & infer.
Compact "update" and "forget" configs inherit from a learn sweep via
_learn_config, _learn_run_name, and _learn_num_epochs:
# @package _global_
_learn_config: /experiments/learn_gemma3_4b
_learn_run_name: learning_rate1e-05 # the learn run to continue from
_learn_num_epochs: 9 # epoch checkpoint to load
train:
upd:
num_update_epochs: 20
learning_rate: 4e-6The forget recipe is the same shape, plus a train.forget block and the dataset to continue on. The _epoch<N> suffix in the filename indicates that training resumes from model_after_epoch_<saturation_epoch + N>:
# @package _global_
_learn_config: /experiments/learn_gemma3_4b
_learn_run_name: learning_rate1e-05
_learn_num_epochs: 9 # = saturation epoch + N
train:
learning_rate: 2e-6
lr_scheduler_type: "cosine"
forget:
num_forget_epochs: 50
num_forget_samples: 5000Running src/origins/prepare_sweeps/prepare_experiments.py on an update / forget recipe writes the same folders under experiments/ as the learn case (e.g. experiments/update_gemma3_4b{_train,_infer,_grade,}/ for the update recipe above). It resolves the inherited learn config, points the training at the right checkpoint, forces the appropriate phase flags, and ties the matching infer + grade configs to those outputs.
See src/origins/configs/experiments/EXAMPLE_{learn,update,forget}_gemma3_4b.yaml for fully worked sweep templates, and src/origins/prepare_sweeps/prepare_experiments.sh for the exact prepare_experiments.py invocations used to generate every sweep in experiments/ (including the --train_accelerate_config train_launch_big variants for the larger models).
@article{davidson2026futurefacts,
title={The Future of Facts: Tracing the Factual Generation-Verification Gap},
author={Tim R. Davidson and Anja Surina and Caglar Gulcehre},
journal={arXiv preprint arXiv:2605.27564},
year={2026},
url={https://arxiv.org/abs/2605.27564},
}This project is released under the terms of the MIT License.
The initial scaffolding was generated from the Python Machine Learning Research Template.
