distance

A command-line program to calculate pairwise genetic distances within or between alignments of DNA sequences in fasta format.

Installation

You will need to install Rust first.

Then you should be able to install distance by running:

cargo install --git https://github.com/benjamincjackson/distance

and the binary will be installed somewhere in your $PATH:

distance -h

or clone the repository and build it:

git clone https://github.com/benjamincjackson/distance.git
cd distance
cargo build --release

and the binary gets built in the repo's directory:

./target/release/distance --version

Usage

You can calculate all pairwise distances within a single alignment by providing a single input file, like:

distance alignment.fasta > distances.tsv

or, equivalently:

distance -i alignment.fasta -o distances.tsv

or, equivalently, but reading from stdin and printing to stdout:

cat alignment.fasta | distance

Or you can calculate all pairwise comparisons between two alignments like:

distance alignment1.fasta alignment2.fasta > distances2.tsv

or, equivalently:

distance -i alignment1.fasta alignment2.fasta -o distances2.tsv

Alignments read from stdin or provided as positional files (or provided to -i, which is equivalent) are read into memory by default (but see below).

You can also calculate all pairwise distances between one alignment in memory and one alignment streamed from disk, using the -s / --stream flag, like:

distance -i smallAlignment.fasta -s bigAlignment.fasta -o distances3.tsv

You can also force -s to accept input from stdin if you pass it -, like:

cat bigAlignment.fasta | distance smallAlignment.fasta -s - > distances3.tsv

The output is a tab-separated-value file with three columns:

> head distances.tsv
sequence1	sequence2	distance
seq1	seq2	0.000719991771
seq1	seq3	0.001887958258
seq1	seq4	0.001949663239
seq1	seq5	0.002853704658
seq1	seq6	0.002861871595
seq1	seq7	0.001748731312
seq1	seq8	0.002784940691
seq1	seq9	0.000788562416
seq1	seq10	0.001381740301

Different distance measures are available. These are n - the total number of nucleotide differences, n_high - should give exactly the same answer as n but will be faster in some circumstances (higher diversity datasets, smaller datasets, comparing a small number of sequences in -i with a large number of sequences in -s), raw - the number of nucleotide differences per site, jc69 - Jukes and Cantor's (1969) evolutionary distance, k80 - Kimura's (1980) evolutionary distance, and tn93 - Tamura and Nei's (1993) evolutionary distance.

By default, distance will spin up as many threads for pairwise comparisons (in addition to a thread for i/o) as it detects there are logical CPUs available to it.

If you prefer, you can use the -t option to specify how many threads to spin up for pairwise comparisons , e.g.:

distance -t 8 -m jc69 -i alignment.fasta -o jc69.tsv

You can also use the -b option to tune the workload per thread, which may result in some extra efficiency:

distance -t 8 -b 1000 -m jc69 -i alignment.fasta -o jc69.tsv

Help

> distance -h
Calculate genetic distances within/between fasta-format alignments of DNA sequences

Usage: All sequences across all input files must be the same length.

       distance alignment.fasta
       cat alignment.fasta | distance
       distance alignment.fasta -o distances.tsv
       distance -t 8 -m jc69 alignment.fasta -o jc69.tsv
       distance alignment1.fasta alignment2.fasta > distances2.tsv
       distance -i smallAlignment.fasta -s bigAlignment.fasta -o distances3.tsv
       cat bigAlignment.fasta | distance smallAlignment.fasta -s - > distances3.tsv

Options:
  -i, --input [<input>...]     One or two input alignment files in fasta format. Loaded into memory. This flag can be omitted and the files passed as positional arguments
  -s, --stream <stream>        One input alignment file in fasta format. Streamed from disk (or stdin using "-s -"). Requires exactly one file also be loaded
  -m, --measure <measure>      Which distance measure to use [default: raw] [possible values: n, n_high, raw, jc69, k80, tn93]
  -o, --output <output>        Output file in tab-separated-value format. Omit this option to print to stdout
  -t, --threads <threads>      How many threads to spin up for pairwise comparisons. Omitting this option spins up the number of available CPUs
  -b, --batchsize <batchsize>  Try setting this >(>) 1 to tune the workload per thread [default: 1]
  -l, --licenses               Print licence information and exit
  -h, --help                   Print help
  -V, --version                Print version

Acknowledgements

This program incorporates rust-bio. This program makes use of the bitwise coding scheme for nucleotides by Emmanuel Paradis, as used in ape (Paradis, 2004). Equation (7) in Tamura and Nei (1993) is also rearranged according to ape's source code.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.github/workflows		.github/workflows
src		src
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

distance

Installation

Usage

Help

Acknowledgements

About

Uh oh!

Releases 8

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

distance

Installation

Usage

Help

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages