- Introduction
- Compatibility
- Install
- Getting Started
- Troubleshooting and FAQs
- Other Information
- Legal and Compliance Information
- Updates and Release Notes
Table of contents generated with markdown-toc
A very Canadian utility for genomic clustering and distance querying.
This program is under active development and is used for creating distance matrices from allelic profiles, or for comparing groups of isolates against multiple. This program is similar to cgmlst-dists from Torstein Tseeman, and uses test data from his original program.
[Matthew Wells] : matthew.wells@phac-aspc.gc.ca
beave has only been tested on Linux, any system that supports G++ can compile the program. As only the C++ 23 standard library is used, the program may be able to be compiled on Windows system.
This program relies heavily on the compiler to optimize the program and add SIMD instructions, it is recommended to compile the program on your local computer to get the full benefit of the potential instruction sets your CPU may offer. Compilation using AVX-512 instruction sets has been tested, however in our testing the programs performance degrades likely due to throttling by the CPU.
To build the beave Python package, you will first need to install dependencies listed in the pyproject.toml file. Python version 3.13 or greater is required, along with scikit-build-core and the nanobind Python package.
Python runtime dependencies include numpy >= 2.4.0 and polars >= 1.40.1 and scipy >= 1.17.0.
Start by pulling the repository.
git clone https://github.com/phac-nml/beave
git submodule update --init --recursive
To build and install the Python package you must have the following python packages, scikit-build-core and nanobind which can be installed with pip install nanobind scikit-build-core[pyproject].
Developers can run pip install --no-build-isolation -ve .[dev] or pip install --no-build-isolation -Ceditable.rebuild=true -ve .[dev]. Further examples can be found in the nanobind documentation here: nanobind packaging.
To build a wheel that can be distributed instead of installed, simply run pip wheel .
-
Pull the GitHub repository as described above.
-
Create the Conda environment by running
conda env create -f environment.yml -
Activate the environment with:
conda activate beave -
pip install .to install for developmentpip install --no-build-isolation -ve .[dev] -
Python can then be run with
pytest.
This program is written entirely in C++ 23. The only dependencies are a G++ compiler and CMake. Catch2 is required for testing. However the library is only required for testing and is managed by CMake. This means an internet connection is required when first building the program.
To build the program pull the latest branch and follow the proceeding instructions:
cd ./beave
mkdir build && cd build
cmake .. -DTARGET_GROUP=release
make -j4
This will compile a release build and the assembled binary will be available in the release directory created by CMake. This will be located in the build directory. The resulting binary can be copied into a bin directory your system path can find it.
To run tests, follow the build instructions below (it is presumed you are in the build directory created in the previous step already):
cmake .. -DTARGET_GROUP=test
make -j4
make test
Unit tests are run by Catch2. However, all tests are orchestrated by CTest, which is typically bundled with CMake.
To create a debug build follow the next steps (it is presumed you are still in the debug directory):
cmake .. # The debug build is the default
make -j4
The output binary will be in the debug directory.
The main help message for the program is shown below:
>>> beave -h
usage: beave [-h] [--n-threads N_THREADS] [--delimiter DELIMITER] [--columns COLUMNS] [--count-missing] [--scaled]
[--filter-threshold FILTER_THRESHOLD] [--verbose] [--version]
{cluster,match} ...
A quick proof of concept of generic utilities for nomenclature assignment.
positional arguments:
{cluster,match} Select a program to run.
cluster Run denovo clustering.
match Run fast matching.
options:
-h, --help show this help message and exit
--n-threads, -n N_THREADS
Specify the number of threads to be used. [default 12]
--delimiter, -d DELIMITER
Input alleles delimiter. [default \t] (default: )
--columns, -k COLUMNS
A file containing a single column of the column names to subset from the passed allele
profiles. (default: None)
--count-missing, -c Count missing values in allele profiles differences. (default: False)
--scaled, -s Compute the scaled distance. Distance is presented as a percentage, or a value between
0.0-100.0 (default: False)
--filter-threshold, -f FILTER_THRESHOLD
Excluded samples from analysis if it is missing more than the specified percentage of data.
Must be between 0.0 and 100.0. [default 100.0]
--verbose Display logger debug messages. (default: False)
--version, -v show program's version number and exit
To run de-novo clustering use the cluster option. The long form options for cluster are shown below:
>>> beave cluster --help
usage: beave cluster [-h] [--n-threads N_THREADS] [--delimiter DELIMITER] [--columns COLUMNS] [--count-missing]
[--scaled] [--filter-threshold FILTER_THRESHOLD] [--verbose] --input INPUT
[--tree-output TREE_OUTPUT] [--cluster-output CLUSTER_OUTPUT]
--thresholds THRESHOLDS [THRESHOLDS ...] [--method {single,average,complete}]
[--tree-distances {patristic,cophenetic}]
options:
-h, --help show this help message and exit
--n-threads, -n N_THREADS
Specify the number of threads to be used. [default 12]
--delimiter, -d DELIMITER
Input alleles delimiter. [default \t]
--columns, -k COLUMNS
A file containing a single column of the column names to subset from the passed allele
profiles.
--count-missing, -c Count missing values as differences.
--scaled, -s Compute the scaled distance. Distance is presented as a percentage, or a value between
[0.0-100.0]
--filter-threshold, -f FILTER_THRESHOLD
Excluded samples from analysis if it is missing more than the specified percentage of data.
Must be between [0.0-100.0]. [default 100.0]
--verbose Display logger debug messages.
--input, -i INPUT Input alleles.
--tree-output, -t TREE_OUTPUT
File path to write generated tree. [default clusters.nwk]
--cluster-output, -l CLUSTER_OUTPUT
File path to write generated clusters. [default clusters.tsv]
--thresholds, -p THRESHOLDS [THRESHOLDS ...]
List of threshold values to use.
--method, -m {single,average,complete}
Hierarchical clustering linkage to use. [default: average]
--tree-distances, -b {patristic,cophenetic}
Determine how to display tree lenghts in the newick file. [default cophenetic]
>>> # Example programs
>>> beave cluster --input src/beave/tests/data/R1KC1K.2-zeroes.does-not-exist.csv -t tree.out -m average -l clusters.tsv -sc -b cophenetic -n 2 -p 1 0.5 -d ,
>>> beave cluster --input src/beave/tests/data/R1KC1K.tsv -t tree.out -m average -l clusters.tsv -b cophenetic -n 0 --thresholds 10 9 8The match argument may be used to compute pairwise distances between a group of query samples against a group of reference samples. The parameters for running match are described below:
>>> beave match --help
usage: beave match [-h] [--n-threads N_THREADS] [--delimiter DELIMITER] [--columns COLUMNS] [--count-missing]
[--scaled] [--filter-threshold FILTER_THRESHOLD] [--verbose] --reference REFERENCE --query QUERY
[--threshold THRESHOLD] [--output OUTPUT]
options:
-h, --help show this help message and exit
--n-threads, -n N_THREADS
Specify the number of threads to be used. [default 12]
--delimiter, -d DELIMITER
Input alleles delimiter. [default \t]
--columns, -k COLUMNS
A file containing a single column of the column names to subset from the passed allele
profiles.
--count-missing, -c Count missing values as differences.
--scaled, -s Compute the scaled distance. Distance is presented as a percentage, or a value between
[0.0-100.0]
--filter-threshold, -f FILTER_THRESHOLD
Excluded samples from analysis if it is missing more than the specified percentage of data.
Must be between [0.0-100.0]. [default 100.0]
--verbose Display logger debug messages.
--reference, -r REFERENCE
Profiles to compare against.
--query, -q QUERY Profiles containing new-samples for comparisons.
--threshold, -t THRESHOLD
Only report distances below specified threshold. [default: inf]
--output, -o OUTPUT Fast match result output tsv file. [default: output.tsv]
>>> # Example programs
>>> beave match -q src/beave/tests/data/R1KC1K.head.tsv -r src/beave/tests/data/R1KC1K.tail.tsv -sc --verbose
>>> beave match -q src/beave/tests/data/R1KC1K.head.tsv -r src/beave/tests/data/R1KC1K.tail.tsv -t 101 -m average -o output.tsv -n 1The inputs for this program must be tabular, any delimiter is supported as long is it is a single character. The first column of the file must contain no duplicates or missing values. The columns are not inspected to verify unique values only, so duplicate column names will be name mangled and treated as another unique column. The characters "?", " ", "", "-", "_", and "0" are treated as missing values by the program unless the -c option is added to the program. All other values are treated as a valid alleles. Example inputs can be found in the tests folder. Thresholds are always converted to float values, however you can specify either integers not just decimals.
When running match, the query and reference profiles will be merged by the program. If duplicate ID's are detected an error will be raised by the program.
The program outputs a Newick-Format file containing the tree generated by whichever linkage metric is selected, the sample IDs and their addresses are put out in a separate file specified by the user in TSV format. Addresses are delimited by an '.'.
Example of cluster outputs:
| SampleID | level_10.0 | level_9.0 | denovo_address |
|---|---|---|---|
| CoolSample | 1 | 2 | 1.2 |
| CoolSample2 | 2 | 1 | 2.1 |
The output of match is a single file showing the query sample, the reference sample and distance.
The general structure of the match output:
| query_id | ref_id | dist_{hamming,scaled} |
|---|---|---|
| 1 | 2 | 4 |
| 1 | 3 | 8 |
| 1 | 4 | 10 |
The help message for the program can be brought up by executing as shown below, (-h|--help) can be used to print the help message at any time as well:
$ beave
No args passed
Subcommands:
matrix - Create distance matrix with an input profile.
fast-match - Compare a set of profiles to a set of query profiles.
Examples:
beave matrix -i profiles.tsv -t 4 -sc > output.tsv
beave fast-match -i qprofiles.tsv -r profiles.tsv -t 4 -sc > output.tsv
The program will display the top-level help messages with the example commands, and warn the user that no arguments have been passed.
Note - The program supports multi-threading but is intended for use on a single CPU therefore the maximum number of CPU cores that can be specified currently is 256.
The help message for commands is displayed below along with default settings. Flag options can be specified in a sequence as a single letter string e.g. "-sc" or sperately "-s -c". When passing characters such as delimiters you may have to specify ANSI C quoted strings in bash (e.g to specify a tab delimiter $'\t'). The default values are specified in the help message.
$ beave matrix
Subcommands:
matrix - Create distance matrix with an input profile.
fast-match - Compare a set of profiles to a set of query profiles.
Examples:
beave matrix -i profiles.tsv -t 4 -sc > output.tsv
beave fast-match -i qprofiles.tsv -r profiles.tsv -t 4 -sc > output.tsv
Command Options
--input| -i:
Input file file of profiles. [required]
--threads| -t:
How many threads to run. default = 1 [optional]
--missing| -m:
Specify the charactar to use for missing values. default = 0 [optional]
--delimiter| -d:
Delimiter for table. default = \t [optional]
--scaled| -s:
Calculate a scaled distance metric. [flag]
--count-missing| -c:
Include missing values in count of differences. [flag]
The help message for fast-matching is shown below. The one value that differs to generating a distance matrix is the addition of the -r flag to specify a set of reference profiles.
Subcommands:
matrix - Create distance matrix with an input profile.
fast-match - Compare a set of profiles to a set of query profiles.
Examples:
beave matrix -i profiles.tsv -t 4 -sc > output.tsv
beave fast-match -i qprofiles.tsv -r profiles.tsv -t 4 -sc > output.tsv
Command Options
--input| -i:
Input file file of profiles. [required]
--reference| -r:
Reference profiles to use for fast matching. [required]
--threads| -t:
How many threads to run. default = 1 [optional]
--missing| -m:
Specify the charactar to use for missing values. default = 0 [optional]
--delimiter| -d:
Delimiter for table. default = \t [optional]
--scaled| -s:
Calculate a scaled distance metric. [flag]
--count-missing| -c:
Include missing values in count of differences. [flag]
-t|--threads: Specify the number of threads passed to the program, if you specify more threads than profiles to use. Then only 1 thread will be used.-m|--missing: This parameter allows you to specify a character to specify marking certain allele values as uncalled. The default value is '0', but '?', '-' etc. can be passed.-d|-delimiter: Specify the delimiter used by the input file, a tab character is the default but simply specify ',' to use a comma instead, any single character can be used.-s|--scaled: Use the scaled distance instead of Hamming, this metric will report the Hamming distance normalized by the number of comparisons made.-c|--count-missing: Specify this value to treat missing allele calls as values. By default comparisons to missing allele calls are not made however enabling this value to count missing values as difference will greatly increase the speed of the program.
A detailed description of the help flags for each program is provided below.
Input data needs to be a flat tabular file, the left most column is treated as the sample columns and all subsequent columns are the alleles. The header line is for the most part ignored but is required. If the first line in your table is sample information it will be treated as the file header.
Example inputs can be found in the data directory of the repository.
-
There may be issues with the Python build system, please create a GitHub issue for any issues identified.
-
If there are duplicate identifiers with different profiles. The distance between the two values will still be reported as no duplicates detection is performed currently. Future iterations may add this functionality.
-
-ffast-mathis enabled during compilation to prevent sub-normals. This leads to some error in floating point operations, the affect of this is being evaluated and this compiler flag may be removed after further testing is performed.
Copyright Government of Canada [2026]
Written by: National Microbiology Laboratory, Public Health Agency of Canada
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Please see the CHANGELOG.md.