Official code for the ACL 2026 paper "Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence" by Jinseok Chung, Minkyoung Song*, Hyunji Jung*, and Namhoon Lee (POSTECH). *Equal contribution.
📄 Paper: arXiv:2606.19353
conda env create -f environment.yml
conda activate fvpython -c "import nltk; nltk.download('wordnet')" # once
python prepare_data.py # build all datasets
# or a subset:
python prepare_data.py --only wordnet moons ag_news emotion gsm8k hellaswagWordNetMCQ and moons are constructed locally; AG News / Emotion / GSM8K / HellaSwag are
downloaded from the Hugging Face Hub and reformatted. Sources and licenses are listed in
dataset_files/README.md.
The pipeline has two stages: select causal heads once, then run the self-function-vector experiments on top of them.
# 1) Causal head selection: CIE analysis -> top_heads.json + mean_head_act.pt
bash scripts/01_compute_cie.sh
# 2) Self-function-vector uncertainty decomposition (also runs the
# Total-entropy and function-vector baselines)
bash scripts/02_run_self_fv.shEach script exposes MODEL, DATA, ANSWER_TYPE, NUM_SHOTS, etc. as environment
variables, e.g.:
MODEL=meta-llama/Llama-2-13b-hf DATA=ag_news ANSWER_TYPE=generation NUM_LAYERS=40 \
bash scripts/01_compute_cie.shModels in the paper: LLaMA2-7B/13B/70B, Qwen2.5-7B, Mistral-7B. Set NUM_LAYERS to the
model's depth (32 / 40 / 80); the intervention layer defaults to ~1/3 of the depth.
main_icl/ Core paper code
main.py Main entry point: CIE head scoring (--cie 1) and the self-FV
experiment (--use_self_fv 1) with baselines (FV / total entropy)
instruction.py Prompt templates / instructions
utils/ prompt, data, model, inference, intervention, metrics
select_top_heads.py Aggregate per-layer CIE tensors -> top_heads.json (head selection)
prepare_data.py Reconstruct all task datasets into dataset_files/icl/
generate_ood_wordnet.py OOD-query dataset generation (used by prepare_data.py)
dataset_files/ Dataset documentation; icl/ is generated, not committed
scripts/ Reproduction entry points
environment.yml Conda environment
Task data (
dataset_files/icl/) and experiment outputs (results*/,wandb/,*.pt, figures) are git-ignored and regenerated byprepare_data.py/ the pipeline.
@inproceedings{chung2026quantifying,
title = {Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence},
author = {Chung, Jinseok and Song, Minkyoung and Jung, Hyunji and Lee, Namhoon},
booktitle = {Proceedings of the 2026 Annual Meeting of the Association for Computational Linguistics (ACL)},
year = {2026},
eprint = {2606.19353},
archivePrefix = {arXiv},
url = {https://arxiv.org/abs/2606.19353}
}We are grateful to the authors of the following repositories, which we referred to while developing this work: