A Go toolkit for computational genomics research. Provides library packages for sequence I/O, alignment, and NGS data processing, along with CLI commands for common bioinformatics operations. Particular focus on Oxford Nanopore (long-read) sequencing workflows.
Module: github.com/compgen-io/cgkit
make build # Build all targets (darwin_arm64, linux_arm64, linux_amd64)
make test # Run all testsStreaming readers and writers for FASTA and FASTQ files with transparent gzip support.
SeqReader/SeqRecordinterfaces for uniform access across formatsFastaReader/FastqReader— lazy, streaming readers viaNextSeq(); support indexed lookup by nameFastaWriter/FastqWriter— writers with optional line wrapping (FASTA) and gzip outputSeqQual— core type holding sequence, quality, name, strand, and position; supportsRevComp()andSub()extraction- Memory-efficient chunked iteration via Go
iter.Seq
Smith-Waterman based alignment with affine gap penalties and Oxford Nanopore-aware homopolymer discounting.
NewLocalAligner()— Smith-Waterman local alignment (soft clipping)NewGlobalAligner()— Needleman-Wunsch end-to-end alignmentNewSemiGlobalAligner()— full query aligned, free target end gapsDnaAlignmentDefaults()/OntAlignmentDefaults()— preset scoring parameters- Configurable scoring matrix, gap penalties, clipping, and homopolymer discount via builder pattern
AlignBatch()— parallel alignment with semaphore-controlled goroutine poolCigarCondense()/CigarExpand()— convert between run-length and per-base CIGAR formatsMSA()— incremental consensus multiple sequence alignment returning anMSAAlignmentwith optional homopolymer compression and reference sequence handlingMSAAlignment— result type withConsensus(),RehydratedConsensus(),WriteClustal(),WriteFasta(),GappedSequences()for library-level output
Native reading and writing of SAM, BAM, and tabix-indexed files. Samtools is only required for CRAM.
Reading:
SamReader— interface withNext(),Header(),Query(),Close()NewSamReader()— auto-detects format:.bam→ native BAM reader,.sam/.sam.gz→ native text reader,.cram→ samtoolsQuery(ref, start, end)— returnsiter.Seq2[*SamRecord, error]for indexed region queries (BAM via BAI, CRAM via samtools)- Flag, MAPQ, and tag filtering via
SamReaderOpts
Writing:
SamWriter— interface withWrite(),Close()NewSamWriter()— native BAM output (unsorted or coordinate/name sorted with merge sort), samtools for CRAM- Sorted BAM writer buffers ~768MB, flushes to temp files, merge-sorts on Close
Tabix:
TabixReader— query tabix-indexed BGZF files (BED, VCF, GFF) with TBI or CSI index auto-detectionTabixWriter— sorted BGZF output with optional.tbiindex generation; presets for BED, VCF, GFF- Both use
iter.Seq2for query results with 0-based half-open coordinates
Index support:
- BAI, TBI, CSI index parsers with shared
Query()interface ParseRegion()— converts samtools-style region strings (chr1:1000-2000) to 0-based half-open
Core types:
SamRecord— full SAM record with flag accessors (IsUnmapped(),IsReverse(), etc.) and typed tag accessSamHeader— header manipulation including@PGline generationTagFilter— flexible tag-based filtering with comparison operators
Low-level BGZF (Blocked GNU Zip Format) support used by BAM and tabix.
Reader/Writer— streaming BGZF read/write with virtual offset trackingIndexedReader— random access with LRU block cache (default 64 blocks); supports virtual offset seeking and.gziindex for uncompressed offset seekingNewBGZipFile()— convenience constructor for file-backed BGZF output
- sequtils — IUPAC ambiguity matching, reverse complement, homopolymer run analysis, 4-bit DNA encoding
- utils —
Semaphorefor concurrency control,PositionTrackingReader, float formatting - analysis/seq — GC content calculation
Usage: cgkit [--profile=cpu.prof] <command>
| Command | Description |
|---|---|
fasta-wrap |
Reformat sequences to a specified line width (-w, default 70) |
fasta-gc |
Calculate GC content of sequences |
| Command | Description |
|---|---|
fastq-gc |
Calculate GC content of sequences |
fastq-tag |
Add a tag to the comment field of records |
| Command | Description |
|---|---|
seq-revcomp |
Reverse complement a sequence |
seq-pairwise |
Pairwise alignment with configurable scoring, gap penalties, and homopolymer discounts |
seq-msa |
Multiple sequence alignment via incremental consensus (CLUSTAL by default; --fasta or --consensus for alternates; --hp-compress collapses homopolymers and rehydrates the consensus; --ref <name> marks a reference sequence that is aligned last, displayed first, and used for HP tiebreaks) |
| Command | Description |
|---|---|
sam-export |
Export selected columns and tags as tab-delimited text |
sam-filter |
Filter reads by region, flags, MAPQ, or tags and write to a new file |
sam-toseq |
Convert reads to FASTA or FASTQ |
| Command | Description |
|---|---|
ont-tags |
Find and trim common ONT adapter/primer tags from FASTQ reads |
ont-umi-cluster |
Collapse similar UMIs in a coordinate-sorted BAM file |
ont-umi-lookup |
Match reads to UMI clusters from ont-umi-cluster output |