Skip to content

LOG-postech/self-fv-icl

Repository files navigation

Quantifying Aleatoric Uncertainty of In-Context Learning via Self-Function Vectors

Official code for the ACL 2026 paper "Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence" by Jinseok Chung, Minkyoung Song*, Hyunji Jung*, and Namhoon Lee (POSTECH). *Equal contribution.

📄 Paper: arXiv:2606.19353

Setup

conda env create -f environment.yml
conda activate fv

Data

python -c "import nltk; nltk.download('wordnet')"   # once
python prepare_data.py                              # build all datasets
# or a subset:
python prepare_data.py --only wordnet moons ag_news emotion gsm8k hellaswag

WordNetMCQ and moons are constructed locally; AG News / Emotion / GSM8K / HellaSwag are downloaded from the Hugging Face Hub and reformatted. Sources and licenses are listed in dataset_files/README.md.

Reproducing the experiments

The pipeline has two stages: select causal heads once, then run the self-function-vector experiments on top of them.

# 1) Causal head selection: CIE analysis -> top_heads.json + mean_head_act.pt
bash scripts/01_compute_cie.sh

# 2) Self-function-vector uncertainty decomposition (also runs the
#    Total-entropy and function-vector baselines)
bash scripts/02_run_self_fv.sh

Each script exposes MODEL, DATA, ANSWER_TYPE, NUM_SHOTS, etc. as environment variables, e.g.:

MODEL=meta-llama/Llama-2-13b-hf DATA=ag_news ANSWER_TYPE=generation NUM_LAYERS=40 \
  bash scripts/01_compute_cie.sh

Models in the paper: LLaMA2-7B/13B/70B, Qwen2.5-7B, Mistral-7B. Set NUM_LAYERS to the model's depth (32 / 40 / 80); the intervention layer defaults to ~1/3 of the depth.

Repository layout

main_icl/                 Core paper code
  main.py                 Main entry point: CIE head scoring (--cie 1) and the self-FV
                          experiment (--use_self_fv 1) with baselines (FV / total entropy)
  instruction.py          Prompt templates / instructions
  utils/                  prompt, data, model, inference, intervention, metrics
select_top_heads.py       Aggregate per-layer CIE tensors -> top_heads.json (head selection)
prepare_data.py           Reconstruct all task datasets into dataset_files/icl/
generate_ood_wordnet.py   OOD-query dataset generation (used by prepare_data.py)
dataset_files/            Dataset documentation; icl/ is generated, not committed
scripts/                  Reproduction entry points
environment.yml           Conda environment

Task data (dataset_files/icl/) and experiment outputs (results*/, wandb/, *.pt, figures) are git-ignored and regenerated by prepare_data.py / the pipeline.

Citation

@inproceedings{chung2026quantifying,
  title     = {Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence},
  author    = {Chung, Jinseok and Song, Minkyoung and Jung, Hyunji and Lee, Namhoon},
  booktitle = {Proceedings of the 2026 Annual Meeting of the Association for Computational Linguistics (ACL)},
  year      = {2026},
  eprint    = {2606.19353},
  archivePrefix = {arXiv},
  url       = {https://arxiv.org/abs/2606.19353}
}

Acknowledgments

We are grateful to the authors of the following repositories, which we referred to while developing this work:

About

[ACL 2026] Official implementation of "Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors