Calibration Toolbox is a Python library for evaluating machine learning model calibration using binning-based metrics. The package provides a comprehensive collection of calibration metrics and visualization tools for research in deep learning and uncertainty quantification.
- Comprehensive Metrics: ECE, MCE, RMSCE, ACE, SCE, and more
- General Calibration Error Framework: Flexible GCE function for custom metrics
- Framework-Agnostic: Works with NumPy arrays - no PyTorch or TensorFlow required
- Visualization Tools: Reliability diagrams, confidence histograms, and class-wise calibration curves
- Well-Tested: Extensive test coverage with edge case handling
- Research-Oriented: Implementations based on recent research papers
pip install calibration-toolboxgit clone https://github.com/Jonathan-Pearce/calibration-toolbox.git
cd calibration-toolbox
pip install -e .pip install -e ".[dev]"import numpy as np
from calibration_toolbox import expected_calibration_error, reliability_diagram
# Your model's predicted probabilities
probabilities = np.array([[0.8, 0.2], [0.6, 0.4], [0.9, 0.1]])
labels = np.array([0, 1, 0])
# Compute Expected Calibration Error
ece = expected_calibration_error(probabilities, labels)
print(f"ECE: {ece:.4f}")
# Visualize calibration
reliability_diagram(probabilities, labels)The GCE is a flexible framework that can compute various calibration metrics through parameter configuration:
Where:
-
$acc(b,k)$ and$conf(b,k)$ are the accuracy and confidence of bin$b$ for class$k$ -
$n_{bk}$ is the number of predictions in bin$b$ for class$k$ -
$N$ is the total number of data points -
$K$ is the number of classes -
$p$ is the norm parameter (1, 2, or ∞)
Usage:
from calibration_toolbox import general_calibration_error
gce = general_calibration_error(
probabilities, labels,
n_bins=15,
class_conditional=True,
adaptive_bins=False,
top_k_classes='all',
norm=1,
thresholding=0.0
)Reference: Naeini et al. (2015). "Obtaining Well Calibrated Probabilities Using Bayesian Binning." AAAI.
ECE measures the difference between model confidence and accuracy across uniformly-spaced bins:
from calibration_toolbox import ECE
ece = ECE(probabilities, labels, n_bins=15)Reference: Naeini et al. (2015). "Obtaining Well Calibrated Probabilities Using Bayesian Binning." AAAI.
MCE is the maximum calibration error across all bins:
from calibration_toolbox import MCE
mce = MCE(probabilities, labels, n_bins=15)Reference: Hendrycks et al. (2019). "Deep Anomaly Detection with Outlier Exposure." ICLR.
RMSCE is the root mean square of calibration errors:
from calibration_toolbox import RMSCE
rmsce = RMSCE(probabilities, labels, n_bins=15)Reference: Nixon et al. (2020). "Measuring Calibration in Deep Learning." CVPR Workshops.
SCE is the class-conditional calibration error with uniform binning:
from calibration_toolbox import SCE
sce = SCE(probabilities, labels, n_bins=15)Reference: Nixon et al. (2020). "Measuring Calibration in Deep Learning." CVPR Workshops.
ACE uses adaptive binning (equal number of samples per bin):
from calibration_toolbox import ACE
ace = ACE(probabilities, labels, n_bins=15)Reference: Gupta et al. (2021). "Calibration of Neural Networks using Splines." ICLR.
Computes calibration error for the top-k predicted classes:
from calibration_toolbox import top_k_calibration_error
top2_ce = top_k_calibration_error(probabilities, labels, k=2)Reference: Thulasidasan et al. (2019). "On Mixup Training." NeurIPS.
OE measures the degree of overconfidence:
from calibration_toolbox import overconfidence_error
oe = overconfidence_error(probabilities, labels)Shows the relationship between predicted confidence and actual accuracy:
from calibration_toolbox import reliability_diagram
reliability_diagram(probabilities, labels, n_bins=15)Shows the distribution of model confidences:
from calibration_toolbox import confidence_histogram
confidence_histogram(probabilities, labels, n_bins=15)Shows calibration curves for each class separately:
from calibration_toolbox import class_wise_calibration_curve
class_wise_calibration_curve(probabilities, labels)Compares multiple calibration metrics in one plot:
from calibration_toolbox import calibration_error_decomposition
calibration_error_decomposition(probabilities, labels)If your model outputs logits instead of probabilities, set logits=True:
from calibration_toolbox import ECE
# Model outputs logits
logits = np.array([[2.0, -1.0], [1.0, 0.5], [-0.5, 1.5]])
labels = np.array([0, 0, 1])
# Compute ECE (will apply softmax internally)
ece = ECE(logits, labels, logits=True)Check out the examples directory for Jupyter notebooks demonstrating:
- Basic usage and metric computation
- Visualization creation
- Comparing models with different calibration qualities
- Advanced usage of the GCE framework
Full documentation is available at:
- GitHub Pages: https://jonathan-pearce.github.io/calibration-toolbox/ (once enabled)
- Read the Docs: https://calibration-toolbox.readthedocs.io/ (once published)
To build and view the documentation website locally:
# Option 1: Use the build script
bash build_docs.sh
# Option 2: Manual build
pip install sphinx sphinx-rtd-theme nbsphinx
cd docs
make html
# View the website
python -m http.server 8000 --directory docs/_build/html
# Then open http://localhost:8000 in your browserRun the test suite:
python -m pytest tests/ -vOr use the test script:
bash run_tests.shWith coverage (requires pytest-cov):
pip install pytest-cov
pytest tests/ -v --cov=calibration_toolbox --cov-report=htmlContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
If you use Calibration Toolbox in your research, please cite:
@software{calibration_toolbox,
author = {Pearce, Jonathan},
title = {Calibration Toolbox: A Python Library for Model Calibration Evaluation},
year = {2026},
url = {https://github.com/Jonathan-Pearce/calibration-toolbox}
}This package is inspired by and follows the structure of:
- Uncertainty Toolbox - Comprehensive uncertainty quantification library
Key papers that informed this package's design:
- Naeini et al. (2015): "Obtaining Well Calibrated Probabilities Using Bayesian Binning." AAAI.
- Guo et al. (2017): "On Calibration of Modern Neural Networks." ICML.
- Kull et al. (2019): "Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration." NeurIPS.
- Nixon et al. (2020): "Measuring Calibration in Deep Learning." CVPR Workshops.
- Gupta et al. (2021): "Calibration of Neural Networks using Splines." ICLR.
See papers.md for a complete list of references.
This project is licensed under the MIT License - see the LICENSE file for details.
- The calibration research community for developing these metrics
- The Uncertainty Toolbox team for inspiration on package structure
- Contributors and users of this library
Maintained by: Jonathan Pearce
Repository: https://github.com/Jonathan-Pearce/calibration-toolbox