Vectorium is a Rust library for storing, accessing, and compressing dense, sparse and multivector embedding datasets. The main goal is to provide a unified dataset/encoder interface that can be shared by indexing/search crates such as Seismic and kANNolo.
If you are new to KNN: exhaustive KNN searches score every vector in the dataset and return the top‑k closest results. That is accurate but slow at scale. ANN indexes (HNSW, IVF, Seismic, etc.) trade a bit of accuracy for speed by building extra data structures (e.g., proximity graphs, inverted indexes) on top of the same dataset/encoder primitives.
Vectorium includes an exhaustive search API (Dataset::search) and a binary executable for ground-truth computation on CPU. For state‑of‑the‑art ANN indexing, use these tools: Seismic and kANNolo.
Vectorium ships a binary compute_groundtruth (feature cli) under src/bin/ to exhaustive top‑k for a set of queries, writes a TSV file.
This repository currently targets nightly Rust (rust-toolchain.toml pins it). For performance‑oriented builds, prefer
--release and (optionally) native CPU tuning:
RUSTFLAGS="-C target-cpu=native" cargo build --release --features cliAfter building, you will find binary compute_groundtruth in target/release/.
You can also run via Cargo directly:
RUSTFLAGS="-C target-cpu=native" cargo run --release --features cli --bin compute_groundtruth -- --helpNote:
compute_groundtruthdepends on optional CLI dependencies and requires--features cli.
This tool computes exhaustive top‑k neighbors for each query and writes the results as a TSV file. It is designed for research/benchmarking workflows where you need exact results (ground truth) to compare against an ANN index. It parallelizes across queries using Rayon, which is typically the right granularity for CPU ground‑truth runs.
--dataset-type dense(default): expects.npyinputs (dataset + queries). The reader loads.npyasf32and can optionally convert the dataset storage via--value-type.--dataset-type sparse: expects Seismic binary format inputs (dataset + queries) viaread_seismic_format.
Each output line is:
query_id<TAB>doc_id<TAB>rank<TAB>score
query_id: 0‑based query index.doc_id: aVectorId(0‑based vector index in the dataset).rank:1..=k.score:- for
--distance euclidean: squared Euclidean distance (smaller is better); - for
--distance dotproduct: dot product score (larger is better).
- for
Dense vectors stored in .npy, Euclidean, PQ:
RUSTFLAGS="-C target-cpu=native" cargo build --release --features cli
./target/release/compute_groundtruth \
-i <path>/dataset.npy \
-q <path>/queries.npy \
-o sift_gt_l2.tsv \
--dataset-type dense \
--value-type f32 \
--distance euclidean \
--encoder pq \
--pq-subspaces 8Dense vectors stored in .npy, dot product, plain (no PQ):
RUSTFLAGS="-C target-cpu=native" cargo build --release --features cli
./target/release/compute_groundtruth \
-i <path>/dataset.npy \
-q <path>/queries.npy \
-o sift_true_gt_ip.tsv \
--dataset-type dense \
--value-type f32 \
--distance dotproduct \
--encoder plainSparse dataset (Seismic binary), dot product, DotVByte compression:
RUSTFLAGS="-C target-cpu=native" cargo build --release --features cli
./target/release/compute_groundtruth \
-i <path>/dataset.seismic \
-q <path>/queries.seismic \
-o sparse_gt_ip.tsv \
--dataset-type sparse \
--component-type u16 \
--value-type f32 \
--distance dotproduct \
--encoder dotvbyte-i, --input-file <PATH>: dataset path (.npywhen--dataset-type dense, Seismic binary whensparse).-q, --query-file <PATH>: query path (same format as the dataset type).-o, --output-path <PATH>: output TSV path.-k, --k <N>: number of neighbors to output per query (default: 10).-d, --distance <euclidean|dotproduct>: scoring metric (default:euclidean).-v, --value-type <f32|f16|bf16|fixedu8|fixedu16>: dataset storage value type (default:f32).- Dense queries are always
f32. - Sparse queries are always
f32(dataset values may be quantized).
- Dense queries are always
--dataset-type <dense|sparse>: dataset format (default:dense).--component-type <u16|u32>: sparse component type (default:u32).--encoder <plain|pq|dotvbyte>: encoder (default:plain).pq: dense‑only, requires--value-type f32, and uses--pq-subspaces.dotvbyte: sparse‑only, requires--component-type u16and--distance dotproduct.
--pq-subspaces <M>: number of PQ subspaces (0auto‑selects a supported value that divides the dataset dimension).- Product Quantization splits vectors into
Mequal‑sized chunks, sodim % M == 0is required. - Supported values are currently
{128, 96, 64, 32, 16, 8, 4}.
- Product Quantization splits vectors into
The library is organized around two core concepts:
- A
Datasetowns encoded vectors and an encoder. - A
VectorEncoderdefines the encoded representation and how to evaluate distances via aQueryEvaluator.
Indexes typically store a dataset plus extra search structures. During search they repeatedly need distances between a query and candidate vectors inside the dataset. Vectorium is designed to make that path explicit and efficient: you can build a query evaluator once, then score many candidate vectors without decoding them first.
use vectorium::{Dataset, DenseDataset, DenseVectorView, DotProduct, DatasetGrowable, PlainDenseDatasetGrowable, PlainDenseQuantizer};
let encoder = PlainDenseQuantizer::<f32, DotProduct>::new(3);
let mut dataset = PlainDenseDatasetGrowable::new(encoder);
dataset.push(DenseVectorView::new(&[1.0, 0.0, 2.0]));
dataset.push(DenseVectorView::new(&[0.5, 1.5, 0.0]));
let frozen: DenseDataset<_> = dataset.into();
assert_eq!(frozen.len(), 2);
assert_eq!(frozen.get(0).values(), &[1.0, 0.0, 2.0]);use vectorium::{Dataset, DotProduct, DatasetGrowable, PlainSparseDataset, PlainSparseDatasetGrowable, PlainSparseQuantizer, SparseVectorView};
let encoder = PlainSparseQuantizer::<u16, f32, DotProduct>::new(5, 5);
let mut dataset = PlainSparseDatasetGrowable::new(encoder);
dataset.push(SparseVectorView::new(&[1_u16, 3], &[1.0, 2.0]));
dataset.push(SparseVectorView::new(&[0_u16, 4], &[0.5, 3.5]));
let frozen: PlainSparseDataset<u16, f32, DotProduct> = dataset.into();
assert_eq!(frozen.len(), 2);
assert_eq!(frozen.get(0).components(), &[1_u16, 3]);The typical pattern for an index is: build the evaluator once, then score many candidates.
use vectorium::{Dataset, DatasetGrowable, DenseDataset, DenseVectorView, DotProduct, PlainDenseDatasetGrowable, PlainDenseQuantizer, QueryEvaluator, VectorEncoder, VectorId};
let encoder = PlainDenseQuantizer::<f32, DotProduct>::new(3);
let mut growable = PlainDenseDatasetGrowable::new(encoder);
growable.push(DenseVectorView::new(&[1.0, 0.0, 2.0]));
growable.push(DenseVectorView::new(&[0.5, 1.5, 0.0]));
let dataset: DenseDataset<_> = growable.into();
let query = DenseVectorView::new(&[0.2, 0.1, 0.7]);
let evaluator = dataset.encoder().query_evaluator(query);
let candidate: VectorId = 0;
let score = evaluator.compute_distance(dataset.get(candidate));
let _ = score;If you do not have an ANN index yet, Dataset::search(query, k) provides an exhaustive top‑k baseline:
use vectorium::{Dataset, DatasetGrowable, DenseDataset, DenseVectorView, DotProduct, PlainDenseDatasetGrowable, PlainDenseQuantizer};
let encoder = PlainDenseQuantizer::<f32, DotProduct>::new(3);
let mut growable = PlainDenseDatasetGrowable::new(encoder);
growable.push(DenseVectorView::new(&[1.0, 0.0, 2.0]));
growable.push(DenseVectorView::new(&[0.5, 1.5, 0.0]));
let dataset: DenseDataset<_> = growable.into();
let query = DenseVectorView::new(&[0.2, 0.1, 0.7]);
let top2 = dataset.search(query, 2);
assert_eq!(top2.len(), 2);The easiest way to use multiple CPU cores is to parallelize across queries. This is the strategy used by
compute_groundtruth.
use rayon::prelude::*;
use vectorium::{Dataset, DatasetGrowable, DenseVectorView, DotProduct, PlainDenseDatasetGrowable, PlainDenseQuantizer};
let encoder = PlainDenseQuantizer::<f32, DotProduct>::new(3);
let mut growable = PlainDenseDatasetGrowable::new(encoder);
growable.push(DenseVectorView::new(&[1.0, 0.0, 2.0]));
growable.push(DenseVectorView::new(&[0.5, 1.5, 0.0]));
let queries = vec![vec![0.2_f32, 0.1, 0.7], vec![1.0_f32, 0.0, 0.0]];
let results: Vec<_> = queries
.par_iter()
.map(|q| growable.search(DenseVectorView::new(q.as_slice()), 1))
.collect();
assert_eq!(results.len(), 2);Datasets expose range_from_id/id_from_range so callers can keep lightweight handles to the underlying storage ranges.
This is mainly useful for sparse/packed layouts, where range lookups can be a cache miss and prefetching can help.
use vectorium::{Dataset, DatasetGrowable, DenseDataset, DenseVectorView, DotProduct, PlainDenseDatasetGrowable, PlainDenseQuantizer};
let encoder = PlainDenseQuantizer::<f32, DotProduct>::new(3);
let mut growable = PlainDenseDatasetGrowable::new(encoder);
growable.push(DenseVectorView::new(&[1.0, 0.0, 2.0]));
let dataset: DenseDataset<_> = growable.into();
let id = 0;
let range = dataset.range_from_id(id);
dataset.prefetch_with_range(range.clone());
let view = dataset.get_with_range(range);
assert_eq!(view.values(), &[1.0, 0.0, 2.0]);- Type safety by construction: datasets are sealed and are tied to a specific encoder type, so mixing dense/sparse/packed representations is prevented at compile time.
- Evaluator-driven search: encoders build a
QueryEvaluatorfrom the query once; the evaluator can then score many dataset vectors. This matches how most ANN indexes are structured internally. - Distance ordering: distances implement
Ord.DotProductuses reversed ordering (larger is better), whileSquaredEuclideanDistanceuses the natural ordering (smaller is better). Distance values must not be NaN. - Range-based access:
range_from_id/get_with_rangeandprefetch_with_rangeare meant for index implementations that keep storage ranges around (especially for sparse/packed datasets) and want predictable, low-overhead access.
MIT. See LICENSE-MIT.
