OverlapIndex (OI)
This package provides an implementation of the Overlap Index (OI), a cluster-validity measure designed to quantify the degree of overlap between data classes or clusters. The OI can be updated online with ARTMAP-based backends, or computed in batch with offline clustering backends, making it useful for streaming, continual learning, large-scale representation analysis, and embedding-space diagnostics.
The implementation supports multiple swappable clustering backends:
- Fuzzy ARTMAP and Hypersphere ARTMAP for incremental / online updates.
- KMeans and MiniBatchKMeans for offline centroid-based analysis.
- BallCover for offline greedy landmark-ball covers, useful when the goal is to preserve class-support geometry for downstream shape or topology analysis.
To install OverlapIndex, simply use pip:
pip install overlapindexThat installs the default batch-oriented dependencies. To enable the incremental ART backends as well, install the optional ART extra:
pip install "overlapindex[art]"The core package and optional art extra support Python 3.9 through 3.14.
Or to install directly from the most recent source:
pip install git+https://github.com/NiklasMelton/OverlapIndex.git@developThe Overlap Index is bounded in the interval [0, 1] and has the following interpretation:
-
OI = 1.0
Indicates perfect class separation (no overlap). -
OI = 0.5
Indicates complete overlap between classes. -
OI < 0.5
Indicates a degenerate or pathological case in the data distribution.
The index is computed incrementally by tracking shared cluster activations between pairs of classes and aggregating class-wise overlap into a global measure.
-
Incremental and Offline Modes
ARTMAP backends support streaming updates viaadd_sampleand mini-batch updates viaadd_batch. Offline backends such asKMeans,MiniBatchKMeans, andBallCoversupport batch computation throughadd_batch. -
Label-Aware
Can be applied both to labeled raw data and to intermediate representations (e.g., neural network activations). -
Geometry-Agnostic
Works well on arbitrary geometric structures of data. No geometric constraints are assumed.
The Overlap Index can be used in several settings:
-
Unsupervised clustering evaluation
As an iCVI, OI provides insight into the quality of a clustering partition as it evolves over time. -
Class separability analysis
Measures the degree of overlap in labeled datasets without requiring a classifier. -
Representation monitoring in deep learning
Tracks how class separation changes across layers or training epochs. -
Backbone evaluation for transfer learning
Compares feature extractors, where higher OI values indicate better class separation in the backbone embeddings.
- ART-based clustering is performed using
artlib’sFuzzyARTMAPorHypersphereARTMAP. artlibis an optional dependency and is only required when using the"Fuzzy"or"Hypersphere"backends.- Offline centroid backends fit one clustering model per class and concatenate the resulting class-owned prototypes into global cluster ids.
- The
BallCoverbackend fits one greedy ball cover per class and treats ball centers as class-owned prototypes. - Normalize input features before fitting. Examples in this repository use
MinMaxScalerfor convenience. - ART backends complement-code inputs internally and therefore require features in the
[0, 1]interval. - Offline backends (
KMeans,MiniBatchKMeans, andBallCover) consume normalized features directly and do not apply complement coding. - Overlap is estimated by monitoring shared best-matching units (BMUs) or top prototype activations between class pairs.
- The global OI is computed as the macro mean of per-class minimum pairwise overlap scores, so each observed class contributes equally to
index. - A support-weighted companion score is available through
weighted_indexfor workflows that need the score to reflect observed class frequencies. - Global aggregation can exclude one or more label ids through
exclude_classeswithout removing those labels from fitting, singleton scores, or pairwise scores.
from sklearn.preprocessing import MinMaxScaler
from overlapindex import OverlapIndex
# Normalize features before fitting.
X = MinMaxScaler().fit_transform(X)
# MiniBatchKMeans is the default backend and is recommended for most offline use cases.
oi = OverlapIndex(
kmeans_k=10,
kmeans_kwargs={"random_state": 0},
)
# sklearn-style API
oi.fit(X, y)
score = oi.indexThe fitted value is available through oi.index. For users who prefer update methods that return the current score directly, add_batch(X, y) is also supported.
exclude_classes lets you keep a label fully involved in overlap evaluation
while omitting it from the two global summary scores:
oi = OverlapIndex(exclude_classes=0)
oi = OverlapIndex(exclude_classes=[0, "unlabeled"])This is useful for segmentation workflows where only foreground objects are
labeled but background-only samples should still contribute to pairwise overlap
counts. A common pattern is to create one background class containing those
samples, then pass that class id to exclude_classes. The background class will
still appear in singleton_index, pairwise_index, and prototype ownership;
only index and weighted_index omit it from aggregation.
from overlapindex import OverlapIndex
# For ARTMAP backends, batches should already be scaled into [0, 1].
oi = OverlapIndex(
model_type="Hypersphere",
rho=0.9,
match_tracking="MT+",
)
for X_batch, y_batch in stream:
oi.partial_fit(X_batch, y_batch)
score = oi.indexFor single-sample streams, ARTMAP backends also support add_sample(x, y), which updates the model and returns the current score directly. Labeled mini-batches can also be passed to add_batch(X, y).
OverlapIndex supports both sklearn-style methods and direct score-returning update methods:
| Method | Returns | Typical use |
|---|---|---|
fit(X, y) |
self |
Full offline fitting on a labeled dataset. |
partial_fit(X, y) |
self |
Incremental batch updates for ARTMAP backends; offline backends refit on the provided batch. |
score() / score(X, y) |
float |
Read the current index, or refit on labeled data and return the new score. |
predict(X) |
np.ndarray |
Return the highest-scoring global prototype id for each sample. |
fit_predict(X, y) |
np.ndarray |
Fit and return per-sample prototype ids. |
add_batch(X, y) |
float |
Batch update when the current OI score is needed immediately. |
add_sample(x, y) |
float |
Single-sample online update for ARTMAP backends. |
After fit or partial_fit, read the current score from oi.index or call score().
For model_type="KMeans", model_type="MiniBatchKMeans", and
model_type="BallCover", partial_fit(X, y) is a convenience wrapper around
recomputing the index on the provided labeled batch. Only the ARTMAP backends
perform true incremental updates across calls.
If a batch is empty or contains only one unique class, OverlapIndex emits a
RuntimeWarning and leaves the score at its default value of 1.0.
OverlapIndex uses model_type="MiniBatchKMeans" by default and supports several backend families through the model_type parameter:
model_type |
Update mode | Description |
|---|---|---|
"Fuzzy" |
Online / batch | Incremental Fuzzy ARTMAP backend. Requires the optional art extra. |
"Hypersphere" |
Online / batch | Incremental Hypersphere ARTMAP backend. Requires the optional art extra. |
"KMeans" |
Offline batch only | Fits one scikit-learn KMeans model per class. |
"MiniBatchKMeans" |
Offline batch only | Default backend. Fits one scikit-learn MiniBatchKMeans model per class; recommended for larger datasets. |
"BallCover" |
Offline batch only | Fits one greedy landmark-ball cover per class. Useful when preserving class-support geometry is important. |
Offline backends should be used with fit or add_batch. They do not support add_sample because their prototypes are fit from a complete labeled batch.
from overlapindex import OverlapIndex
OI = OverlapIndex(
model_type="KMeans",
kmeans_k=10,
kmeans_kwargs={"random_state": 0},
)
OI.fit(X, y)
score = OI.indexfrom overlapindex import OverlapIndex
OI = OverlapIndex(
model_type="MiniBatchKMeans",
kmeans_k=10,
kmeans_kwargs={
"random_state": 0,
"batch_size": 8192,
"n_init": 1,
},
)
OI.fit(X, y)
score = OI.indexfrom overlapindex import OverlapIndex
OI = OverlapIndex(
model_type="BallCover",
ballcover_k="auto",
ballcover_radius=0.25,
ballcover_kwargs={
"metric": "auto",
"cover_fraction": 1.0,
},
)
OI.fit(X, y)
score = OI.indexThe BallCover backend supports one automatic cover parameter at a time:
ballcover_k="auto"with a fixedballcover_radiusgreedily adds balls until the requested cover fraction is reached.ballcover_k=<int>withballcover_radius="auto"selects a fixed number of landmarks and infers the radius needed to cover the requested fraction of samples.
metric="auto" uses Euclidean distance in lower-dimensional spaces and cosine geometry for high-dimensional inputs such as embedding vectors. Users can override this with metric="euclidean" or metric="cosine".
from sklearn.datasets import load_iris
import numpy as np
from overlapindex import OverlapIndex
# Load dataset
iris = load_iris()
# Feature matrix (shape: [150, 4])
X = iris.data.astype(np.float64)
# Target vector (shape: [150,])
y = iris.target.astype(np.int64)
# Normalize the data (required)
x_max = X.max(axis=0)
x_min = X.min(axis=0)
X = (X - x_min) / (x_max - x_min)
# Instantiate the OI object
OI = OverlapIndex()
# Calculate the Overlap Index
OI.fit(X, y)
print(OI.index)
# Output:
# 0.9266666666666666Additional runnable examples are available in the examples/ directory.
For release testing, start from a fresh Poetry environment so the package under
test matches pyproject.toml and poetry.lock:
poetry env remove --all
poetry sync --with dev
poetry run python -c "from overlapindex import OverlapIndex; OverlapIndex(model_type='MiniBatchKMeans')"
poetry run python -m pytest -q tests/test_overlap_index_regression.py
poetry sync --with dev --extras art
poetry run python -c "from overlapindex import OverlapIndex; OverlapIndex(model_type='Hypersphere')"
poetry run python -m pytest -q tests/test_overlap_index_regression.py
poetry check
python -m build
twine check dist/*The first install verifies that offline backends work without the optional
artlib dependency. The second install verifies the art extra and ARTMAP
backends.
-
rho(float)
Vigilance parameter controlling cluster granularity for ARTMAP backends. -
r_hat(float, Hypersphere ARTMAP only)
Maximum cluster radius for the Hypersphere backend. -
model_type("Fuzzy" | "Hypersphere" | "KMeans" | "MiniBatchKMeans" | "BallCover")
Clustering backend used to create class-owned prototypes. Defaults to"MiniBatchKMeans". -
match_tracking(str)
Match-tracking strategy used during ARTMAP learning. -
kmeans_k(int or dict)
Number of clusters per class forKMeansandMiniBatchKMeansbackends. -
kmeans_kwargs(dict, optional)
Keyword arguments forwarded to the selected scikit-learn KMeans backend. -
ballcover_k(int, dict, or "auto")
Number of balls per class, class-specific ball counts, or"auto"for greedy fixed-radius covering. -
ballcover_radius(float, dict, or "auto")
Ball radius, class-specific radii, or"auto"when using a fixed number of balls. -
ballcover_kwargs(dict, optional)
Additional BallCover options such asmetric,cover_fraction,chunk_size,max_balls, andrandom_state. -
exclude_classes(None, scalar label, or iterable of labels)
Label ids to omit from the globalindexandweighted_indexaggregation while leaving all fitting and per-class overlap outputs intact.
The default parameters are intended for offline batch use with MiniBatchKMeans. For online or continual-learning workflows, explicitly choose model_type="Fuzzy" or model_type="Hypersphere". For very large ART-based runs, smaller rho values (0.5-0.7) may improve run-time performance.
-
index
Global macro Overlap Index across all observed classes that are not listed inexclude_classes. This is the default class-balanced score and is usually preferred for imbalance-sensitive separation analysis. -
weighted_index
Support-weighted Overlap Index across observed classes that are not listed inexclude_classes. This weights each included class'ssingleton_indexvalue by its positive sample count, which can be useful when reporting should reflect observed class frequencies. -
singleton_index[y]
Minimum pairwise overlap score for classy. -
pairwise_index[(y, b)]
Pairwise overlap score between classesyandb.
This package is intended for researchers and practitioners working on:
- incremental and continual learning,
- clustering validation,
- representation learning,
- transfer learning
